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' We present new and improved approximation and FPT algorithms for computing rooted and un- 

rooted maximum agreement forests (MAFs) of pairs of phylogenetic trees. Their sizes correspond 
to the subtree-prune-and-regraft distances and the tree-bisection-and-reconnection distances of 
the trees, respectively. We also provide the first bounded search tree FPT algorithm for comput- 
ing rooted maximum acyclic agreement forests (MAAFs) of pairs of phylogenetic trees, whose 
sizes are the hybridization numbers of these pairs of trees. These distance metrics are essential 
tools for understanding reticulate evolution. 
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Abstract. 



Phylogenetic trees are a standard model to represent the evolutionary relationships among a set 
of species and are an indispensable tool in evolutionary biology [18J. Early methods of building 



\& . phylogenetic trees used morphology, or structural characteristics of species, to determine their re- 



latedness. However, advances in molecular biology have allowed the widespread use of DNA and 
protein sequence data to build phylogenies. Molecular phylogenetics is particularly useful in the 
study of microscopic organisms due to their high rate of evolution and subtle differences in appear- 
ance. However, even good phylogenetic inference methods cannot guarantee that a constructed tree 
correctly represents evolutionary relationships — and there may not even exist such a tree — because 
not all groups of species follow a simple tree-like evolutionary pattern. Collectively known as 
reticulation events, non-tree- like evolutionary processes, such as hybridization, lateral gene transfer 
(LGT), and recombination, result in species being composites of genes derived from different ances- 
tors. These processes allow species to rapidly acquire useful traits and adapt to new environments. 
This includes harmful traits of pathogenic bacteria, such as antibiotic resistance, and LGT appears 
to have contributed to the emergence of pathogens such as Mycobacterium tuberculosis |26j . 

Due to reticulation events, phylogenetic trees representing the evolutionary history of different 
genes found in the same set of species may differ. To reconcile these differing evolutionary histories, 
one can use phylogenetic distance metrics that determine how well the evolutionary hypotheses of 
two or more phylogenetic trees agree and often allow us to discover reticulation events that explain 
the differences between the trees. To simultaneously represent these discordant topologies, one can 
use a hybridization network, which is a generalization of a phylogenetic tree that allows species to 
inherit from more than one parent. 



'Supported by an NSERC PGS-D graduate scholarship. 

* Canada Research Chair; supported in part by NSERC and Genome Atlantic. 

* Canada Research Chair; supported in part by NSERC. 



1 



A number of metrics are commonly used to define the distance between phylogenies. The 
Robinson- Foulds distance [21] is popular, as it can be calculated in linear time [12J. Other met- 
rics, such as the tree bisection and reconnection (TBR) and subtree prune- and-regraft (SPR) dis- 
tances [18] and the hybridization number [2j, are more biologically meaningful but also NP-hard 
to compute [HIS"lllCH[TT]. The SPR distance is equivalent to the minimum number of lateral gene 
transfers required to transform one tree into the other [2j,[3] and thus provides a lower bound on the 
number of reticulation events needed to reconcile the two phylogenies. Although the TBR distance 
has no known direct biological meaning, TBR operations are often used to explore the space of 
phylogenetic trees. The hybridization number is the minimum number of edges that must be added 
to one tree to transform it into a hybridization network of both trees and thus provides a lower 
bound on the number of hybridization events in an evolutionary history consistent with both trees. 

The minimum number of reticulation events required to reconcile two trees provides the simplest 
explanation for the difference between the trees. For this reason, these metrics have been regularly 
used to model reticulate evolution [2Tj[23] , and the development of efficient algorithms to compute 
the distance between two trees under these metrics has been the focus of much research (see 
Section [Lip . The close relationship between SPR operations and reticulation events has also led 
to advances in network models of evolution [2|fTU|[23]. 

In this paper, we present the currently fastest approximation and FPT algorithms for the 
SPR distance, TBR distance, and hybridization number of two phylogenies. Similarly to previous 
algorithms for these problems, we model these distance metrics using rooted and unrooted maximum 
agreement forests (MAFs) for SPR and TBR, and maximum acyclic agreement forests (MAAFs) 
for the hybridization number. An agreement forest of two phylogenies has the property that it 
can be obtained from either tree by cutting an appropriate set of edges. As such, it captures 
the evolutionary relationships that are consistent between both trees. Given an agreement forest 
obtained by removing k edges from each tree, a set of k SPR operations (TBR operations in 
the unrooted case) that transform one tree into the other can be recovered easily. A maximum 
agreement forest is an agreement forest obtained by removing the minimum possible number of 
edges. The corresponding set of operations represents a possible minimum set of reticulation events 
that reconcile the two trees. Similarly, given an acyclic agreement forest of two trees (a restriction 
of an agreement forest that disallows the donation of genetic information from descendant nodes 
to ancestor nodes), a hybrid network with that many hybridization events can be constructed 
quickly [10J. 

1.1 Related Work 

While SPR distance and the hybridization number (and potentially TBR distance) capture bio- 
logically meaningful notions of similarity between phylogenies, their practical use has been limited 
by the fact that they are NP-hard to compute [HI5]I10|H7]. There are several standard approaches 
for dealing with NP-hard optimization problems that have been employed to compare phylogenies 
using these distance measures. 

Approximation algorithms. Hein et al. |15] claimed a 3-approximation algorithm for comput- 
ing SPR distances and introduced the notion of a maximum agreement forest (MAF) as the main 
tool underlying both the approximation algorithm and a proposed NP-hardness proof for com- 
puting SPR distances. The central claim was that the number of components in an MAF of two 
phylogenies is one more than the minimum number of SPR operations needed to transform one 
into the other. Unfortunately, there were subtle mistakes in the proofs. Allen and Steel [1] proved 
that the number of components in an MAF is in fact one more than the TBR distance between the 
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two trees. Rodrigues et al. [25] provided instances where the algorithm of [15] provides an approxi- 
mation guarantee no better than 4 for the size of an MAF, thereby disproving the 3-approximation 
claim of [15] . They also proposed a modification to the algorithm, which they claimed to produce 
a 3-approximation for TBR. A counterexample to this claim was provided by Bonet et al. [5], who 
showed, however, that both the algorithms of [15] and [25] compute 5-approximations of the SPR 
distance between two rooted phylogenies, and that the algorithms can be implemented in linear 
time. The approximation ratio was improved to 3 by Bordewich et al. [7j, but at the expense of an 
increased running time of O (n 5 ) - 1 A second 3-approximation algorithm presented in [25] achieves a 
running time of O {n 2 \ Using entirely different ideas, Chataigner [11] obtained an 8-approximation 
algorithm for TBR distances of two or more trees. There is currently no approximation algorithm 
for the hybridization number of two rooted phylogenies. 

Fixed-parameter algorithms. Fixed-parameter algorithms have a running time that is expo- 
nential in some parameter that is specific to the problem but independent of the input size. These 
algorithms are more attractive for determining reticulation events, as they provide exact solutions. 
In most biological data sets, we would expect (or hope) that the number of reticulation events (and 
thus the distance) between two trees would also be relatively small. Thus, the distance under a 
given metric is a natural parameter for fixed-parameter algorithms that compute distances between 
phylogenies. 

The previously best fixed-parameter algorithm for rooted SPR distance is due to Bordewich 
et al. [7] and runs in O (4 fc • k 4 + ra 3 ) time, where k is the distance between the two trees. For TBR 
distance, the previous best result is due to Hallett and McCartin |14j . who provided an algorithm 
with running time O (A k ■ k 5 + p(n)), where p( ) is a polynomial function. An earlier algorithm for 
this problem by Allen and Steel [1] had running time O (jc" k + p(nj). For unrooted SPR, Hickey et 
al. [16] first claimed a fixed-parameter algorithm, but the correctness proof was flawed. Recently, 
St. John [27] proposed a correction of the central technical lemma in Hickey et al.'s result. In [9], 
Bordewich and Semple provided a fixed-parameter algorithm for the hybridization number of two 
rooted phylogenies with running time O ((28fc) fe + n 3 ) . Linz and Semple [19] extended these results 
to non-binary rooted phylogenies. 

Heuristics. We distinguish two types of heuristic approach to solving NP-hard approximation 
problems. The first type of heuristic algorithms are similar to approximation algorithms in that 
they provide approximate solutions efficiently, but they do not provide a guaranteed approximation 
ratio. The second type of heuristic algorithms provide exact solutions with no guaranteed running 
time bound. 

Lat Trans by Hallet and Lagergen [13] models lateral gene transfer events by a restricted version 
of rooted SPR operations, considering two ways in which the trees can differ. It computes the exact 
distance under this restricted metric in O (2 fc n 2 ) time. HorizStory by Macleod et al. [20J supports 
multifurcating trees but does not consider SPR operations where the pruned subtree contains more 
than one leaf. EEEP by Beiko and Hamilton [3] performs a breadth-first SPR search on a rooted 
start tree but performs unrooted comparisons between the explored trees and an unrooted reference 
tree. The distance returned is not guaranteed to be exact due to optimizations and heuristics that 
limit the scope of the search, although EEEP provides options to compute the exact unrooted SPR 
distance with no non-trivial bound on the running time. More recently, RiataHGT by Nakhleh 
et al. [22] calculates an approximation of the SPR distance between rooted multifurcating trees in 
polynomial time. 

1 Using non-trivial but standard data structures, the running time can be reduced to O (n 4 )- 
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Approximation 


FPT 


TBR 


Previous: 
New: 


8-approx in polyn. time [TT] 
3-approx in linear time 


(A k k 5 + p(n)) time [U] 

(A k k + n 3 ) or (A k n) time 


Rooted SPR 


Previous: 
New: 


3-approx in (ra 2 ) time j25] 
3-approx in linear time 


(4 fc fc 4 + n 3 ) time [7] 

(2.42 fc /c + n 3 ) or (2.42 fc n) time 


Hybridization 


Previous: 
New: 




((28k) k + n 3 ) time [9] 

(3.18 fe /s + n 3 ) or (3.18 fc n) time 



Table 1: Previous and new results on rooted SPR distance, TBR distance, and hybridization 
number. 



Reductions. Two algorithms for computing rooted SPR distances, Sprdist [31] and TreeSAT [3], 
express the problem of computing maximum agreement forests as an integer linear program (ILP) 
and a satisfiability problem (SAT), respectively, and employ efficient ILP and SAT solvers to 
obtain a solution. Sprdist has been shown to outperform EEEP and Lattrans [31] . Although such 
algorithms draw on the close scrutiny that has been applied to these problems, the conversion 
process may throw away information that can be exploited, such as a fixed parameter. 

1.2 Contribution 

Our contribution is to develop more efficient methods to compute biologically meaningful distances 
between phylogenetic trees. We provide a unifying view on the approximation and FPT results 
discussed in the previous section and substantially improve on them along the way. In particular, 
using a "shifting lemma" proved by Bordewich et al. [7], we show that the framework of the 
algorithms of [5lll5|[25] can be used not only to approximate the SPR distance between two rooted 
phylogenies, but also to obtain approximation and FPT algorithms for rooted SPR distance and 
(unrooted) TBR distance. We then analyze the structure of rooted agreement forests further and 
identify three distinct subcases that allow us to achieve a greatly improved FPT algorithm. Table [TJ 
shows our new results in comparison to the best previous results. The 3-approximation algorithm 
for rooted SPR distance is the algorithm of Rodrigues et al. [25], with modifications to reduce 
its running time to linear and to compute an MAF in addition to its size. We believe that the 
correctness proof obtained using our approach is simpler than the one presented in [25J. Preliminary 
versions of these results were presented in [28l|29] . 

In [28U29] we also claimed results on computing MAAFs, but we used an incorrect definition 
of an acyclic agreement forest that considers only cycles of length 2. The algorithm consisted of 
two phases. First we produce an agreement forest that is guaranteed to be a supergraph of an 
MAAF. Then we cut additional edges to eliminate cycles. The first phase is not affected by our 
incorrect definition of cycles. To implement the second phase correctly, we present a novel method 
in this paper whose performance is close to the one claimed in |28] , Obtaining this solution required 
substantial new insights into the structure of acyclic agreement forests beyond the results already 
published in |28^l29j and previous work. Our algorithm is the first bounded search tree algorithm 
for computing hybridization numbers and substantially outperforms existing methods. 

The rest of this paper is organized as follows. Section [2] introduces the necessary terminology and 
notation. Section [3] explores the structure of agreement forests and develops key technical lemmas. 
Section [4] presents our FPT and approximation algorithms for rooted and unrooted MAFs. In 
Section [5] we present our MAAF FPT algorithm. This section consists of 5 parts, each of which 
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presents one key tool. We first develop a refined cycle graph, analyze cycles in agreement forests, 
and identify subsets of edges that can be removed from a cyclic agreement forest to give an MAAF. 
These methods together provide a simple refinement algorithm. We then analyze the tree space 
explored by our bounded search tree algorithm to halve the exponential base in the running time 
of the refinement algorithm. We conclude this section with an improved analysis, which shows 
that only slight modifications to the refinement procedure of the MAAF FPT algorithm lead to a 
greatly improved running time, the one shown in Table [TJ Finally, Section [6] presents concluding 
remarks and suggests future work. 



2 Preliminaries 



Throughout this paper, we mostly use the definitions and notation from [Hl5llTll8ll25] • An (unrooted 
binary phylogenetic) X-tree is a tree T whose leaves are the elements of a label set X and whose 
internal nodes each have degree three. For a subset V of X, T(V) is the smallest subtree of T that 
connects all nodes in V . The V -tree induced by T is the smallest tree T\V that can be obtained from 
T(V) using forced contractions] a forced contraction replaces a vertex of degree two and its incident 



edges with a single edge between its neighbours. These concepts are illustrated in Figures 1(a) - 1(c) 



A rooted X-tree is obtained from an unrooted one, T, by subdividing one of T's edges, attaching 
the node this introduces to a new leaf p, declaring p to be the root and defining parent-child and 



ancestor-descendant relations accordingly; see Figure 1(e) Throughout this paper, we consider p 



to be a member of X. For a subset V of X, T(V) and T\V are defined as in the unrooted case, 
but the construction of T\V from T(V) excludes the root of T(V) from forced contractions. See 



Figures 1(f) and l(g 



Given an unrooted X-tree T, a tree bisection and reconnection (TBR) operation cuts an edge xy, 
thereby dividing T into two subtrees T x and T y containing x and y, respectively. Then it introduces 
two new vertices x' and y' into T x and T y by subdividing one edge of each tree, and adds an edge 
x'y' to reconnect the two trees. Finally, x and y are removed using forced contractions. This is 



illustrated in Figure 1(d) 



A subtree prune and regraft (SPR) operation on a rooted X-tree T cuts an edge e x := xp x , 
where p x denotes the parent of x. This divides T into subtrees T x and T Px containing x and p x , 
respectively. Then it introduces a node p' x into T Px by subdividing an edge of T Px and adds an 
edge xp' x , making x a child of p' x . Finally, p x is removed using a forced contraction. See Figure 1(h) 

TBR and SPR operations give rise to distance measures drBR (', •) and dspn (•, •) between X- 
trees, defined as the minimum number of such operations required to transform one tree into the 
other. The trees in Figure 3(a), for example, have SPR distance dspR (71, T2) = 3. 

A related distance measure for rooted X-trees is their hybridization number, hyb (T\, T2), which 
is defined in terms of hybrid networks of the two trees. A hybrid network of T± and T2 is a directed 
acyclic graph H such that both T\ and T2, with their edges directed away from the root, can be 
obtained from H by deleting edges and performing forced contractions. For a vertex x 6 H, let 
deg in (2) be its in-degree and deg[ D (x) = max(0, deg in (x) — 1). Then the hybridization number of 
T\ and T2 is min# Y2xeH ^ e Sin wnere the minimum is taken over all hybrid networks H of T\ 



and T2. This is illustrated in Figure 3(c) 



These metrics are related to the sizes of appropriately defined agreement forests. To define 
these, we first introduce some terminology. For a forest F whose components Xi, T2, . . . , have 
label sets Xi,X%, . . . ,X^, we say F yields the forest with components Xil-Xi, T2IX2, . . . ,T^\Xk] if 
Xi = 0, then Tj(Xj) = and, hence, Ti\Xj = 0. For a subset E of edges of F, we use F — E 
to denote the forest obtained by deleting the edges in E from F, and F 4- E to denote the forest 
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Figure 1: (a) An unrooted A"-tree T. (b) The subtree T(V) for V = {1,2,4}. (c) T\V . (d) Illus- 
tration of a TBR operation, (e) The rooted X-tree obtained by subdividing the edge adjacent to 1 



in the tree in Figure 1(a). (f) The subtree T(V) for V = {1,2,4}. (g) T\V. (h) Illustration of an 
SPR operation. 
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yielded by F — E. Thus, F 4- E is the contracted form of F — E. We say F 4- E is a forest of F. 

Given X-trees Ti and T2 and forests Fi of T\ and F2 of T2, a forest F is an agreement forest 
(AF) of -Fi and F2 if it is a forest of both F\ and F2. F is a maximum agreement forest (MAF) of F% 
and F-2 if there is no AF of F\ and F2 with fewer components. We denote the number of components 
in an MAF of F\ and F% by m (Fi, F2), and the size of the smallest edge set E such that F4F is an 
AF of F% and F2 by e (Fi, F2, F), where F is a forest of F2. Bordewich and Semple [8] showed that, 
for two rooted X-trees T\ and T2, oIspr (Ti, T2) = e (Ti, T2, T2) = m (Ti, T2) — 1. Allen and Steel pQ 
showed that, for two unrooted X-trees T\ and T2, dxBR{T\,T2) = e(Ti,T2,T2) = m(Ti,T2) — 1. 
An MAF of the trees from Figure 3(a) is shown in Figure 3(b) [ 

Hybridization numbers correspond to MAFs with an additional constraint. For two forests F\ 
and F2 of T\ and T2 and an AF F = {C p , C\, C2, • • • , C&} of F\ and F2, we define a cyc/e graph Gf 
of F. Each node of Gp represents a component of F, and there is an edge from node C, to node 
Cj if Ci is an ancestor of Cj in one of the trees. Formally, we map every node x £ F to two nodes 
(j>i(x) G Ti and 4>2{x) £ T2 by defining 0j(x) to be the lowest common ancestor in Ti of all labelled 
leaves that are descendants of x in F. We refer to <fi\{x) and 02 (x) simply as x in this paper, except 
when this creates confusion. For two components Cj and Cj of F with roots r*j and r^, Gj? contains 
the edge (Ci, Cj) if and only if either <^i(?"i) is an ancestor of 4>i{ r j) or ^2(^1) is an ancestor of 
02 (?j')- ^ e sa y ^ ^ s cyclic if C^ contains a directed cycle. Otherwise T is an acyclic agreement 
forest (AAF) of i*i and T2. A maximum acyclic agreement forest (MAAF) of F% and T2 is an AAF 
with the minimum number of components among all AAFs of F\ and F%. We denote its size by 
rh {F\,F2) and the number of edges in a forest F of F2 that must be cut to obtain an AAF of F\ 
and F 2 by e{F 1 ,F 2 ,F). Baroni et al. [2] showed that hyb(T 1 ,T 2 ) = e{T 1 ,T 2 ,T 2 ) =m(T l ,T 2 ) - 1. 

The cycle graphs for the MAF 



An MAAF of the trees from Figure 3(a) is shown in Figure 



3(d) 

respectively. 



and MAAF of these trees are shown in Figures 3(e) and 3(f) 

We write a ~^ b when there exists a path between two nodes a and b of a forest F. For a node 
x of a rooted forest F, F x denotes the subtree of F induced by all descendants of x, inclusive. For 
two rooted forests Ti and F2 and nodes a £ F\ and a' € F2, we say that a and a' agree if F" = F% . 
For simplicity, we refer to both a and a' as a and say that a exists in both forests. For forests F\ 
and F2 and nodes a, c G F\ with a common parent, we say (a, c) is a sibling pair of Fi if a an d c 
exist in F2. Figure [2] shows such a sibling pair. The definition of a sibling pair can be extended also 
to unrooted forests, but it is slightly more technical: For a node x £ F and an edge e incident to x, 
let F x ' e be the component of F — {e} that contains x. For a node x G F% and an edge e incident 
to x, we say F^' e exists in F2 if either there exists a component C of F2 such that F^' e yields C 




Figure 2: A sibling pair (a, c) of two forests Fi and F2: a and c have a common parent in F\, and 
both subtrees Ff and FP exist also in F2. 
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(e) (f) 

Figure 3: (a) SPR operations transforming T\ into T2. Each operation changes the top endpoint 
of one of the dotted edges, (b) The corresponding agreement forest, which can be obtained by 
cutting the dotted edges in both trees. This is an MAF with 4 components, so m(Ti,T2) = 4 
and e(Ti,T2,T2) = (1spr(Ti,T2) = 3. (c) A hybrid network of T\ and T2. This network 
has 4 nodes with an extra parent. So the hybrid number is 4. (d) An MAAF of T\ and T2. 
e(Ti,T2,T2) = hyb(Ti,T2) = 4. (e) The cycle graph of the agreement forest in (b), which in this 
case contains a cycle, (f) The cycle graph of the acyclic agreement forest in (d) contains no cycle. 
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(a) (b) 

Figure 4: Illustration of the Shifting Lemma, (a) The lemma applies because e and / are on the 
boundary of an "empty" component of F — (EL){e}), shown in grey, (b) The lemma does not apply 
because the component with e and / on its boundary contains a labelled leaf y: Vf ~_F-(_Bu{e}) V- 

or there exists a node x' in F2 and an edge et incident to x' such that F^' e = F^ ,e ■ We say the 
node x exists in F2 if the tree F^' e exists in F%, for some edge e incident to x. For a pair of nodes 
a, c £ F\ with a common neighbour, let e a and e c be the edges connecting a and c to their common 
neighbour. Then (a, c) is a sibling pair if the trees F^' ea and F^' ec exist in F2. In this case, we use 
F§ to denote the component of F2 yielded by F^' ea or the subtree Ftt ' e equal to F^' ea , whichever 
case applies; F% is defined analogously. 

The correctness proofs of our algorithms in the next sections make use of the following two 
lemmas. Lemma[T]was shown by Bordewich et al. [7] and is illustrated in Figure HI Suppose we cut 
a set of edges E from a forest F to yield F + E, and there is an edge e of F such that F — (EL) {e}) 
has a component without labelled nodes. This lemma shows that the forest F 4 (E \ {/} U {e}) 
obtained by replacing any edge / G E on the boundary of this "empty" component with e is the 
same as F 4 E. 

Lemma 1 (Shifting Lemma). Let F be a forest of an X-tree, e and f edges of F, and E a subset 
of edges of F such that f G E and e ^ E. Let vj be the end vertex of f closest to e, and v e an end 
vertex of e. If Vf ~f-_b v e and x ^F-(Eu{e}) v f> f or a M x £ X, then F J ^E = F J ^(E\ {/} U {e}). 

Let F\ and F2 be forests of X-trees T\ and T2, respectively. Any agreement forest of F% and F2 
is clearly also an agreement forest of T\ and T2 ■ Conversely, an agreement forest of T\ and T2 is an 
agreement forest of Fx and F2 if it is a forest of F% and there are no two leaves a and b such that 
a b but a oo Fi b. This is formalized in the following lemma. Our algorithms ensure that any 
intermediate forests F\ and F2 they produce have this latter property. Thus, this lemma allows us 
to reason about agreement forests of F\ and F% and of T\ and T2 interchangeably, as long as they 
are forests of F2. 

Lemma 2. Let F\ and F2 be forests of X-trees T\ and T2, respectively. Let F% be the union of 
trees T\ , T2 , . . . , and F2 be the union of forests Fx , F2 , . . . , Ff. such that % and Fi have the same 
label set, for all 1 < i < k. F2 4- E is an AF of Tx and T2 if and only if it is an AF of Fx and F2. 

A triple ab\c of a rooted forest F is defined by a set {a, b, c} of three leaves in the same component 
of F and such that the path from a to b in F is disjoint from the path from c to the root of the 
component. A triple of a forest F\ is compatible with a forest F2 if it is also a triple of F2] otherwise 
it is incompatible with F2. A quartet ab\cd of an unrooted forest F is defined by a set {a, b, c, d} of 
four leaves in the same component of F and such that the path from a to b in F is disjoint from 
the path from c to d. A quartet of a forest Fx is compatible with a forest F2 if it is also a quartet 
of F2; otherwise it is incompatible with F2. An agreement forest of two forests Fx and F2 cannot 
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contain a triple or quartet incompatible with either of the two forests. Thus, we have the following 
observation. 

Observation 1. (i) Let F\ and Fi be forests of rooted X-trees T\ and Ti, and let F be an 
agreement forest of F\ and F%. If ab\c is a triple of F\ incompatible with F2, then a oo F b or 
a oo F c. 

(ii) Let F\ and F2 be forests of unrooted X-trees T\ and T2, and let F be an agreement forest of 
F\ and F%. If ab\cd is a quartet of F± incompatible with F2, then a oo F b, a oo F c or c oo F d. 

For two forests F\ and F2 with the same label set, two components C± and C2 of F\ are said 
to overlap in F2 if there exist leaves a,b G C\ and c,d £ C2 such that the paths from a to b and 
from c to d in F2 exist and are non-disjoint. The following lemma is an easy extension of a lemma 
of [7] , which states the same result for a tree T2 instead of a forest F2 ■ 

Lemma 3. Let F\ and F2 be forests of two X-trees T\ and T2, and denote the label sets of the 
components of F\ by X\,X2, ■ ■ ■ , X^ and the label sets of the components of F2 by Y\, Y2, . . . , Y\. 
F2 is a forest of F\ if and only if (1) for every Yj, there exists an Xj such that Yj C Xi, (2) no two 
components of F% overlap in F\, and (3) no triple of F2 is incompatible with F\. 



3 The Structure of Agreement Forests 

This section presents the structural results that provide the intuition and correctness proofs for the 
algorithms presented in Sections 0] and All these algorithms start with a pair of trees (TijT^) and 
then cut edges, remove agreeing components from consideration, and merge sibling pairs until the 
resulting forests are identical. The intermediate state is that T\ has been reduced to a forest Fi, 
and T2 has been reduced to a forest F%. F\ consists of a tree Ti and a set of components F that 
exist in F2; F2 consists of a set of components F2 that may not agree with Ti and the forest F that 
exists in Ti. The key part of each iteration is deciding which edges in F2 to cut next. The results 
in this section identify small edge sets in F2 such that at least one edge in each of these sets has 
the property that cutting it reduces e (Ti, T2, T2) by one (or e (Ti, T2, T2) in the case of the MAAF 
algorithm). The approximation algorithm cuts all edges in the identified set, and the size of the 
edge set cut in each step gives the approximation ratio of the algorithm. The FPT algorithm tries 
each edge in the set in turn, so that the size of the set gives the branching factor for a bounded 
search tree algorithm. 

Now let (a, c) be a sibling pair of Ti, and assume a and c do not share a common neighbour 
in F2, and neither Fg nor F% is a component of F2. This implies that a and c are in F2 because 
Ti and F2 have the same label set. In the rooted case, assume further that, if a and c belong to 
the same component of F2, a's distance to the root of this component is no less than c's. Now 
let p a denote the unique neighbour of a that does not belong to F£ ; in the rooted case, this is 
a's parent. If a and c belong to the same tree of F2, let b ^ a be the neighbour of p a that does 
not belong to the path from a to c in F2; in the rooted case, this is p a 's other child. If a and c 
belong to different trees, let b be p a 's other child in the rooted case or any neighbour of p a other 
than a in the unrooted case. We use e a and to denote the edges connecting a and b to their 
common neighbour p a . In the unrooted case, the sibling d of c is defined analogously. These labels 



are illustrated in Figures 5(a) and 5(b) for rooted and unrooted trees, respectively. 

By Lemma [21 our conditions on the structure of Ti and F2 ensure that an agreement forest of 
Ti and F2 is an agreement forest of Ti and T2 and vice versa. Cutting e a , e& or e c maintains this 
structure. We make free use of these facts in the following proofs. 
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Figure 5: Tree labels for the case when a ~i? 2 c. (a) The rooted case, (b) The unrooted case. 
3.1 Rooted MAF 

Our first set of lemmas characterizes edges that can be cut in order to make progress towards an 
MAF of two rooted trees. First we show that at least one of the edges e a , e b , and e c has the 
property that cutting it reduces e (T±,T2, F2) by one. This immediately implies that cutting e a , e b 
and e c reduces e (T±,T2, F2) by at least 1, which we require for the approximation algorithm. The 
same holds also for e (T\,T2, F2). The following theorem states this formally. 

Theorem 1. Let T\ and T2 be rooted X -trees, and let F\ be a forest of T\ and F2 a forest 0/T2. 
Suppose F\ consists of a tree T\ and a set of components that exist in F2 ■ Let (a, c) be a sibling 
pair of Ti that is not a sibling pair of F2, and assume neither F% nor F£ is a component of F2. 
Then 

(i) e(Ti,T 2 ,F 2 - {e x }) = e(T 1 ,T 2 ,F 2 ) - I, for some x G {a,b,c}. 

(ii) e(T 1 ,T2,F 2 - {e a ,e b ,e c }) < e(T 1} T 2 ,F 2 ) - 1. 

(Hi) e(Ti,T 2 ,F 2 - {e x }) = e(T u T2,F 2 ) - I, for some x £ {a,b,c}. 

(iv) e(T 1 ,T 2 ,F 2 - {e a ,e b ,e c }) < e(Ti,r2,F 2 ) - 1. 

Proof. Note that (ii) follows immediately from (i), and (iv) follows from (iii). For (i), it suffices 
to prove that there exists an edge set E of size e(T\,T2,F2) such that F 2 — E yields an MAF 
of Ti and T2 and E n {e a ,eb,e c } ^ 0. So assume F2 — E yields an MAF F' of T% and T2 and 
E n {e a , e;,, e c } = 0. By LemmaEl F' is also an MAF of F\ and F2. We prove that we can replace 
an edge / G E with an edge in {e a , e b , e c } without changing the forest yielded by F2 — E. 

First assume b' o^p 2 _ E p aj for all leaves b' G F$. In this case, we choose an arbitrary leaf 
b 1 G F| and the first edge / G E on the path from p b to b'. Lemma [T] implies that F 2 — E and 
F2 — (E \ {/} U {eft}) yield the same forest. If a' oop 2 _ E p ai for all leaves a' G F%, or d oop 2 _ E c, for 
all leaves d G c, then the same argument shows that F2 — E and F2 — (E \ {/} U {e a }) or F 2 — E 
and F 2 — (E \ {/} U {e c }) yield the same forest. 
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Thus, we can assume there exist leaves a 1 G F$, b' G F|, and c' G F2 such that a' ^f 2 -e 
p a ~f 2 -e b' and c' ~f 2 -e c. Since (a, c) is a sibling pair of T%, a'c'\b' is a triple of T\ and, hence, 
of F\, while c ^ F\ implies that either a!b'\d is a triple of T2 or a 1 oop 2 c ' ■ I n either case, the 



triple a'c'\b' is incompatible with F2. Thus, Observation l(i) implies that a' oop 2 _ E c' , because 
F2 — E yields an agreement forest of F\ and F2 and a' ^p 2 -E b'. This, in turn, implies that either 
c" oo F2 _ E x, for all leaves c" G F% and x ^ i 7 ^, or a" oop 2 _ E x, for all leaves a" G Fg and x ^ F^-, 
because (a, c) is a sibling pair of T\. Since a' ^f 2 -e b', the latter cannot be true, that is, we must 
have c" ^f 2 -e x, for all c" G F£ and x £ F£. 

Now, as F£ is not a component of F2, there exists a leaf Z ^ T 1 ^ such that c ~i? 2 Z. At least 
one of the edges on the path from c to Z in F2 belongs to E; let / be the one closest to c. Since 
c" oo F2 _ E x, for all leaves c" G F% and x ^ Fj, edges e c and / satisfy the conditions of Lemma [TJ 
and F2 — E and F2 — (E \ {/} U {e c }) yield the same forest. 

The proof of (hi) is identical to the proof of (i) just given. The only difference is that we consider 
an edge set E that yields an MAAF instead of an MAF. □ 

Theorem [JJ is all that is needed to obtain a linear-time 3-approximation algorithm and an FPT 
algorithm with running time O (3 fc n) for rooted MAF. Next we examine the structure of a sibling 
pair more closely, and develop a more refined analysis as a basis for a faster FPT algorithm. We 
distinguish three cases (see Figure E|) : 

(Separate Components) a oop 2 c, 

(One Pendant Node) a c and the path from a to c in F2 has exactly one pendant node b, 
or 

(Multiple Pendant Nodes) a ~p 2 c and the path from a to c in F2 has q > 2 pendant nodes 
fei,&2,---, V 

The following three lemmas provide stronger statements than Theorem Q] about the membership 
of some edges of F2 in a set E such that F2 -j- E is an AF of T\ and T2 ■ All three lemmas consider 
a sibling pair (a, c) of T\ as in Theorem [TJ and assume neither F% nor F^ is a component of F2. 

Lemma 4 (Separate Components). If a oop 2 c, there exists an edge set E of size e(Ti,T2, F2) 
(resp. e (T\, T2, F2) ) such that F2 4- E is an AF (resp. AAF) of T\ and T2 and E n {e a , e c } ^ 0. 

Proof. Consider an edge set E' of size e (Ti, T2, -F2) and such that F -7- E' is an AF of F\ and F%, 
and assume E' contains the maximum number of edges from {e a ,e c } among all sets E" such that 
\E"\ = e(Ti,T2,i*2) and F 4- is an AF of T\ and T2. Then by the same arguments as in the 
proof of Theorem [H E" n {e a ,e c } = implies that there exist leaves a' G F% and c' G F£ such 
that a' ~p 2 _E' a and c' ^p 2 -E' c. Since (a,c) is a sibling pair of F% but a 00 p 2 c and, hence, 
a' oop 2 _ E i c' , we must have a' oop 2 _ E , x, for all leaves x ^ F2 , or c' oop 2 _ E , x, for all leaves x ^ 
W.l.o.g. assume the former. Since -F2 is not a component of F2, there exists a leaf Z ^ F2 such that 
a Z and, hence, a' Z. For each such leaf Z, the path from a' to Z in F2 contains an edge 
in E' because a' oop 2 _ E , Z, and this edge does not belong to F£ because a' ^p 2 _ E i a. We pick an 
arbitrary such leaf Z, and let / be the edge in E' on the path from a' to Z closest to a. The edges 
e a and / satisfy the conditions of Lemma [TJ and F2 4 E' = F2 4- (£" \ {/} U {e a }), that is, the set 
E := E' \ {/} U {e a } satisfies the lemma. The second claim follows using the same arguments after 
choosing E' of size e (Ti, T2, F2) and such that F 4 E' is an AAF of T\ and T2. □ 

Lemma 5 (One Pendant Node - MAF). If a ~_p 2 c and the path from a to c in F2 has only one 
pendant node b, there exists an edge set E of size e (Ti,T2, F2) such that F2 4 E is an AF of T\ 
and T2 and e& G E. 
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Proof. Again, consider an edge set E' of size e (Ti,T2,F2) and such that F 4 E' is an AF of F\ 
and F2. Assume b is a's sibling in F2 and F' contains the maximum number of edges from {e a , e&, e c } 
among all sets E" such that |F"| = e (Ti,T 2 ,F 2 ) and F 2 4F" is an AF of F x and F 2 . By Theorem OQ 
F' H {e a , e&, e c } 7^ 0. If e& G F', there is nothing to prove. So assume e& G" F'. Let v = p a = pb, and 
u = p v = p c - We distinguish two cases. 

If |F' n {e a ,e c ,e v }\ > 2, we define E := E' \ {e a ,e c ,e v } U {eb,e u }. Clearly, |_E7| < \E'\. Next 
we apply Lemma [3] to prove that F 2 4 F is an AF of F% and F2 . Our conditions that the forest -Pi 
is the union of a tree T\ and a forest F, while F2 is the union of the same forest F and another 
forest F2 with the same label set as T\ imply that F2 fulfills the first condition of Lemma Hence, 
F2 4 E also fulfills this condition, and it suffices to prove that no triple of F2 4 E is incompatible 
with F\ and no two components of F2 4 E overlap in F\. 

Since all triples of F2 4 E' are compatible with Fx > it suffices to consider only triples of F2 4 E 
that do not exist in F2 4 E' and prove that they are compatible with F\. However, all such triples 
involve only leaves in Fg U ^fj because e u G F. Since (o, c) is a sibling pair of Fl, any such triple 
is also a triple of Fi . 

Similarly, since no two components of F 2 4F' overlap in Fi , two components of F 2 4F can overlap 
in Fi only if at least one of them contains a path connecting two leaves in different components 
of F 2 4 F'. The only such component of F 2 4 F contains only leaves from F% U F 2 — because 
{eb,e u } G E — and any path in this component connecting two leaves in different components of 
F 2 4 E' has one endpoint in F2 and the other in F%. Let a' G F 2 and d G F% be the endpoints 
of such a path F 2 . The path Pi from a' to d in Fi stays completely inside Ff a and can overlap 
only paths with both endpoints in Ff a , as a path with exactly one endpoint in Ff a cannot exist 
(the corresponding path in F 2 would have exactly one endpoint in F2 U F| , but {eb,e u } C F). If, 
on the other hand, Pi overlapped a path P with both endpoints in Ff a , the path in F 2 connecting 
the endpoints of P would stay completely in F^ and would overlap P 2 . This, implies that P 2 and 
P belong to the same component of F 2 4 F, that is, no two components of F 2 4 E overlap in Fl. 

If \E' n{e a ,e v ,e c }\ = 1, then either e a G E' or e c G F'. It is not hard to see that, if e a G F, then 
F 4 (F' \ {e a } U {e c }) is also an AF of Fi and F 2 , and vice versa. So we can assume w.l.o.g. that 
e c G E. In this case, we define E := E' \ {e c } U {e;,}, which implies that |F| = |F'|. To prove that 
F 2 4 F is an AF of Fi and F 2 , we use Lemma [3] again. As in the previous case, since F 2 satisfies 
the first condition of Lemma [3j so does F 2 4 F. Next we prove that there are no triples in F 2 4 F 
incompatible with Fi nor components of F 2 4 F that overlap in Fi . 

Again, it suffices to consider only triples of F 2 4 F that are not triples of F 2 4 E' . Each such 
triple xy\z involves at least one leaf d G F 2 and cannot involve a leaf in F|. If all leaves of the 
triple belong to F 2 U F 2 , the argument from the previous case shows that the triple is compatible 
with Fi. Otherwise the triple involves at least one leaf not in F^. We choose two leaves a' G F% 
and V G F\ such that a' ^f 2 -E' u r ^F 2 -E' b'. By LemmaCDand because {e a ,eb,e v } n F' = 0, such 
leaves exist. 

If the triple xy\z includes two leaves ci,c 2 G F 2 , it is of the form cic 2 |z with z £ F%. By the 
choice of a' and 6', the triple a'b'\z is a triple of F 2 4F' and, hence, of Fl. In Fi, the lowest common 
ancestor of a' and b' is an ancestor of a and c, and c is an ancestor of c\ and c 2 . Thus, cic 2 |z is 
also a triple of Fl. 

If the triple xy\z includes exactly one leaf d G Fg, it is of the form a"d\z, dy\z, or xy\d , where 
a" G F 2 and x,y,z ^ F^. In the first case, we observe that a"b'\z is a triple of F 2 4 E' and, hence, 
of Fi, and the lowest common ancestor of a" and b' in Fi is an ancestor of a and c. Thus, a"d\z 
is also a triple of Fl. In the last two cases, a'y\z or xy\a' , respectively, is a triple of F 2 . Since 
E' \ E = {e c } and a' ~^_£y these must also be triples of F 2 4 E' and, hence of Fl. Now the fact 
that (a, c) is a sibling pair of Fi implies that dy\z or xylc', respectively, is also a triple of Fi. 
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Finally, for two components of F 2 -j- F to overlap in F±, at least one of them has to contain a 
path connecting two leaves in different components of F2 -r- E'. Since E' \E = {e c }, every such 
path connects a leaf d G F^ with a leaf z ^ F^. S ince 6 tt ^ F', Lemma [TI and the choice of the set 
E 1 imply that there exists a leaf a' G F 2 such that a' ^f 2 -E' z and a' ^f 2 -e z. Now assume x and 
y are leaves in a different component of F 2 -j- E such that the paths between d and 2 and between 
x and y overlap in F\, and let / be an edge shared between the two paths. If / belongs to Ff a , 
then the same argument as in the case \E' n {e a , e c , e v }\ > 2 leads to a contradiction. If / does not 
belong to Ff a , then z ^ Ff a and the path from a' to z in F\ includes the edge /. This is again 
a contradiction because it implies that the component of F2 — E' containing a' and z overlaps the 
component containing x and y. □ 

Lemma 6 (Multiple Pendant Nodes). If a ~^ 2 c and the path from a to c in F2 has q > 2 pendant 
nodes b%, 62, ■ ■ ■ , b q , there exists an edge set E of size e(Ti,T 2 ,F 2 ) (resp. e (Ti, T 2 , F^J such that 
F2 -7- E is an AF (resp. AAF) and either E n {e a , e c } 7^ or {e^, e& 2 , . . . , ej, } C F. 

Proof. We prove the lemma by induction on a/. For 5 = 1, the claim holds by Theorem[TJ So assume 
q > 1 and the claim holds for g— 1. Assume further that b\ is the sibling of a. By Theorem [IJ there 
exists a set F' of size e (Ti, T 2 , F 2 ) such that F 2 H-F' is an AF of F\ and F 2 and E'n{e a , e^, e c } 7^ 0. 
If F' n {e a , e c } 7^ 0, we are done; otherwise G F'. Let F2 := F 2 -f- {e^}. In F 2 , the path from 
a to c has g — 1 pendant nodes, namely 6 2 ,&3, . . . ,b q . Thus, by the induction hypothesis, there 
exists a set E" of size e (Ti, T 2 , F^) = e (Ti, T 2 , F 2 ) - 1 such that F^ -=- F" is an AF of F 1 and F^ 
and such that E" n {e a ,e c } / or {eb 2 ,eb 3 , ■ ■ ■ ,eb q } Q E" . The set F := F" U {e^} has size 
|F"| + 1 = e (Ti,r 2 ,F 2 ), and F 2 4- F = F2 -f- F" is an agreement forest of F\ and F 2 (because it 
is an agreement forest of F\ and F^). Since F" n {e a , e c } / or {e& 2 , e^, . . . , e& } C F", we have 
Fn{e a ,e c } / or {e 6l , eb 2 , . . . , 6;, } C F; that is, F satisfies the lemma. 

The proof for AAF is identical to the one for AF, as Theorem Q] holds for both AF and AAF. □ 

3.2 Unrooted MAF 

We now consider the structure of unrooted MAFs. The next theorem provides an analogous result 
to Theorem Q] for unrooted MAFs. 

Theorem 2. let T\ and T 2 be two unrooted X -trees, and let F\ be a forest of T\ and F 2 a forest 
o/T 2 . Suppose Ft consists of a tree T\ and a set of components that exist in F 2 . Let (a, c) be a 
sibling pair ofT±. Finally, suppose a and c do not share a neighbour in F 2 , and neither F2 nor F% 
is a component of F 2 . Then e (T\, T 2 , F 2 — {e x }) = e (Ti, T 2 , F 2 ) — 1, /or some x G {a, 6, c, d}. 

Proof. As in the proof of Theorem [H our goal is to show that there exists a set F such that F 2 — F 
yields an MAF of T\ and T 2 and F n {e a , e^, e c , e^} 7^ 0. Again, we show that, if F 2 — F yields 
an MAF F' of Ti and T 2 and F n {e a , e^, e c , e^} = 0, we can find an edge / G F and a node 
x G {a, 6, c, 0!} such that F 2 — F and F 2 — (F \ {/} U {e x }) yield the same forest. 

By the same arguments as in the proof of Theorem [IJ if there exists no leaf a' G F2 such that 
a' ~_f 2 -_b p a , we can choose an arbitrary leaf a" G F2 and replace the first edge in F on the path 
from p a to a" with e a in F without altering the forest yielded by F 2 — F. The same holds if there 
exists no leaf b' G F2 such that b' ~f 2 -e Pa, d G F| such that d ~f 2 -e Pc or d' G F^ such that 

d' ~F 2 -E Pc- 
So assume there exist leaves a 1 G F%, b' G F|, c' G Fg, and <f G F2 such that a' ~p 2 -E 
p a ~_f 2 -e b' and c' ^f 2 -e Pc ^f 2 -e d' . Since (a, c) is a sibling pair of F\, a'd\b'd' is a quartet 
of Fl, while c' ^ F| implies that either a'b'\dd' is a quartet of F 2 or a' c'. In either case, the 
quartet a'd\b'd' is incompatible with F 2 . Since F 2 — F yields an agreement forest of T\ and T 2 , it 
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also yields an agreement forest of F\ and F2, by Lemma[2j Hence, as b' ^f 2 -e 0! and d! ^f 2 -e c' , 



Observation l(ii) implies that a' oop 2 _ E d . Since (a, c) is a sibling pair of F\ and a! ^p 2 _ E b' , 
a' oop 2 _ E c' implies that d oop 2 _ E x, for all x ^ F%. This, however, is a contradiction because 
d ~p 2 -E d'. □ 

Similar to Theorem [0 Theorem [2] immediately implies that e (Ti, T2, F2 — {e a , et, e c , e^}) < 
e(Ti,T2,i ? 2) — 1) which can be used to obtain a 4-approximation algorithm for TBR distance. 
However, we can do a little better. 

Theorem 3. Let T± and T2 be two unrooted X -trees, and let F± be a forest of T\ and F2 a forest 
0/T2. Suppose F\ consists of a tree T\ and a set of components that exist in F2. Let (a, c) be a 
sibling pair of T\ that is not a sibling pair of F2, and assume a and c do not share a neighbour in F2, 
and neither F£ nor F| is a component of F 2 . Then e (T l7 T 2 , F — {e a , e&, e c }) < e (T\, T 2 , F) — 1. 

Proof. Let E be an edge set such that F2 — E yields an MAF F' of T\ and T2. We can again 
assume E n {e a , e^, e c } = 0, as otherwise the theorem holds trivially. Moreover, by Lemma^ F' is 
an agreement forest of F\ and F%. 

As in the proof of Theorem [21 if there exists no leaf a! G F% such that a' ~ f 2 -e Pa, then we can 
choose an arbitrary leaf a" G F£ and replace the first edge in E on the path from p a to a" with e a 
without altering the forest yielded by F2 — E. The same holds if there exists no leaf b' G i 7 ! such 
that b' ^p 2 -E Pa, or d G F£ such that d ^f 2 -e c. 

So we can again assume there exist leaves 0! G i 1 ?, 6' G i*2> an d d G i 5 ^ such that a! ^f 2 -e 
p a ^f 2 -e b' and d ~f 2 -e Pc- Next we show that there exists an edge / G E such that F2 — {E U 
{e a ,eb}) and F2 — (-E\{/}U{e a , et, e c }) yield the same forest. This forest is an agreement forest of T\ 
and T2, as it can be obtained by cutting edges e a and e in F' . Hence, e (T±, T2, F2 — {e a ,e , e c }) < 
\E \ {f}\ = \E\ — 1 = e (Ti, T2, -F2) — 1. Note that this is not the same as claiming that we can 
replace an edge / G E with an edge in {e a , e&, e c } without altering the resulting forest. It is crucial 
that all three edges are cut. 

We observe that (a, c) being a sibling pair in F\ and c F% imply that a'd\b'd' is a quartet of F\ 
incompatible with F2, for all d' in T\ but not in F2 U F^U F§. If a' ^f 2 -e d , then a' ^f 2 -e b' and 



Observation 1(h) imply that d ^f 2 -e d' , for all d' ^ UF| Ui^- If o! oo F2 _ E d, then (a, c) being 
a sibling pair of Fi and a' ^f 2 -e b' imply that the component of F2 — E containing d contains no 
leaves not in F^. Hence, again d oo E2 _ E d' , for all d' ^ F£ U F\ U F%. Therefore, d o^p 2 _ E i x, for 
all x (p F£, where E' := EL) {e a ,eb}. Since the component of F2 containing d contains at least one 
leaf x not in F^ , this implies that there exists an edge in E on the path from x to d that is not 
in F$, and the closest such edge / to d and edge e = e c satisfy the conditions of Lemma [TJ Hence, 
F2 — (E U {e a , e }) and F2 — (E \ {/} U {e a , e&, e c }) yield the same forest. □ 



3.3 Rooted MAAF 

Finally, we consider the structure of rooted MAAFs, in order to design a fast FPT algorithm for 
computing such forests. We cut edges to obtain an MAAF of T\ and T2 in two phases: As long as 
the current forest F2 is not an agreement forest of T\ and T2, we use Lemmas H] and and Lemma [7] 
below to make progress towards an MAAF of T\ and T 2 . Once we obtain an agreement forest F, 
this is a superforest of an MAAF of T\ and T2. The second part of our algorithm is an efficient 
algorithm to refine F to an MAAF, which we discuss in Section [5j 

Lemma [7] states how to handle the case when a ~f 2 c and the path from a to c in F2 has only 
one pendant node b. While Theorem [1] and Lemmas S] and [6] apply to agreement forests and acyclic 
agreement forests alike, Lemma [5] states only that e (Ti, T2, F2 -r- {&b}) < e(7i,T 2 ,F 2 ) — 1 in this 
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case, not that e (T\,T2, F2 4 {e^}) < e(Ti,T 2 ,F 2 ) — 1. This is not just a caveat of the proof: it 
is easy to construct examples where e& is not part of any edge set E such that F2 — E yields an 
MAAF of T\ and T 2 . However, as Lemma \7\ shows, if eb does not belong to such a set, then e c does. 

Lemma 7 (One Pendant Node - MAAF). If a c and the path from a to c in F% has only one 
pendant node b, there exists an edge set E of size e (Ti, T 2 , F 2 ) and such that F 2 4 E is an AAF of 
T\ and T2 and either G E or e c G E. 

Proof. Let E' be an edge set of size e(Ti,T 2 ,.F 2 ) and such that F2 4 i?' is an AAF of T\ and T 2 . 
Assume further that there is no such set containing more edges from {e a , e^, e c } than E' and that 
b is a's sibling in F 2 . As shown in Theorem [Q i?' n {e a ,eb,e c } / 0. If G E' or e c G E' , we are 
done. So assume E' n {e&, e c } = and, hence, e a G As in the proof of LemmalU let v = p a = Pb 
and u = p c = p v . If {e a , e„} C E' , Lemma [1] implies that we can replace e v with e^ in E' without 
changing F 2 S- E', contradicting the choice of E' . So e v ^ E' . As in the proof of Theorem [H there 
must be leaves b' G F\ and c' G -F| such that 6' ~i? 2 _£ 6 and c' ^f 2 -e c; otherwise Lemma [U implies 
that we can replace some edge / G E' with e c or e;, without altering F + E' , which contradicts the 
choice of E' . Now let E := E' \ {e a } U {e?,}. We have \E\ = \E'\, e& G E and, as shown in the proof 
of Lemma [5l F2 4 E is an AF of T\ and T 2 . Next we show that F 2 4- is acyclic. 

Since i 7 2 4£' and i 7 2 4£ ,/ are agreement forests of Ti and T 2 , the mapping </>i(-) maps each node 
of these two forests to a corresponding node in T\. However, a node x G F2 that belongs to both 
F2 4 E and -F 2 4- -E" may map to different nodes in T\ if it has different sets of labelled descendant 
leaves in F 2 4 E and -F 2 4- T?'. For the remainder of this proof, we use (f>i(x) to denote the node in 
T\ a node x G F2 maps to based on its labelled descendant leaves in F 2 4 E, and cj>i(x) to denote 
the node it maps to based on its labelled descendant leaves in F 2 4- E'. 

Now assume for the sake of contradiction that F% 4 E is not acyclic, and let O be a cycle of 
Gf 2 +e- W.l.o.g. we can assume O is simple, that is, contains every component of F 2 4 E at most 
once. Since -F 2 4 E' is acyclic, the root r of at least one component in O either is not a root in 
F2 4 E' or satisfies (fii (r) 7^ (f>i(r). The only root in F 2 4 that does not exist in F2 4 E' is a 
result of cutting edge e;, and is a descendant z of 6 in _F 2 . The only root in F2 4 -B' that has a 
different set of labelled descendant leaves in F 2 4 E is u, if it is a root. For any other component 
root x, we have 4>\{x) = (f>'i(x). Thus, any such cycle O contains at least one of C u and C z , which 
respectively are the components of F 2 4 E with roots u and z. Next we prove that no such cycle 
exists in Gf 2 +e, by using the following five observations. 

(i) Since u and z belong to the same connected component of F2 4 E' and z is the only root of 
F2 4 E that does not exist in F 2 4 E', there is no root x ^ {u, z} of F 2 4 E on the path from 
u to z in T 2 . 

(ii) Since u and z belong to the same connected component of -F 2 4 E' and z is a descendant 
of 14 in T 2 , 0i (z) is a descendant of (j>i{ u ) i n -^- n y component C x with root a; such that 
x £ {u, z] and (f>i(x) = 0i(x) belongs to the path from ^[(u) to <t>i(z) would overlap the 
component containing u in T\. Thus, no such component C x can exist. 

(hi) Since u and c' belong to the same connected component of F 2 4 E' , 4>i( u ) 1S an ancestor of 



d and, by the same arguments as in (ii) , there is no root x £ {u, z] such that 4>'i(x) = 0i(x) 
belongs to the path from d to 4>'\{ u )- 

(iv) Since all labelled descendants of u in F 2 4 E belong to F% U F£, with at least one labelled 
descendant in each of F$ and F£, (pi(u) = p a = p c . In particular, d is a descendant of 4>\{u). 
Since tt has c' and at least one labelled leaf in F\ as descendants in i ? 2 4£' / , 4>i(u) is an ancestor 
of 0i (u). 
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(v) (f)'i(z) = <pi(z) is neither an ancestor nor a descendant of (j>\{u). The latter follows because 
z has at least one labelled descendant leaf in F% -j- E that belongs to F^, while all labelled 
descendant leaves of (pi(u) belong to FgUF^. To see the former, observe that this would imply 
that there are two labelled descendant leaves 61 and 62 of z in F2 -7- E' such that b\ , 62 £ ^2 
and the path from b\ to 62 in Ij. includes <f>i(z). Since d belongs to the same component as 
u in F2 -i- E' , and z belongs to the same component as u in F2 -7- E', this would imply that 
F2 ~t~ E' contains the triple b\ &2 1 c' , while these leaves would form the triple b\d\b2 or 62c' 1 61 
in T\ . This is a contradiction because F2 -7- E 1 is an agreement forest of T\ and T2 . 

Using these observations, we show first that O cannot contain C z . Assume the contrary, and 
let 2/1 and 1/2 be the roots of the components C yi and C y2 preceding and succeeding C z in O. 

Since u is an ancestor of z in F2, we can have 1/2 = u only if <f>i(z) is an ancestor of (fri(u). By 



(v) , this is impossible. So y2 7^ it. 

If yi = w (and y2 7^ u), then 0i(z) = < 
4>i(u) is not an ancestor of 4>i(z). By (ii) 
so that we would obtain a cycle in Gf 2 ^e> 
acyclic. This shows that y\ 7^ u. 



is an ancestor of 4>i(y2) = <t>'\{V2) because, by (v) 



this implies that <f>i(u) is an ancestor of (j>'\{yi) in T\, 
by removing C z from O, contradicting that F2 -j- E' is 



Finally, if yi 7^ it and y2 7^ it, we distinguish two cases. If ^i(yi) = 0i(yi) is an ancestor of 



4>i{z) and y2 is a descendant of z, (i) and (ii) imply that (j)'i(yx) is an ancestor of <fr'i(u) and y2 is a 
descendant of u. If yi is an ancestor of z and <pi{y2) = <t>'\(y2) is a descendant of $i(z), (i) and (ii) 
imply that yi is an ancestor of u and <f>x(y2) is a descendant of 4>'i{u). In both cases, we can obtain 
a cycle in Gf 2 ^e' by replacing C 2 with C u in O, a contradiction. This concludes the proof that O 
cannot contain C z . 

It remains to prove that C u cannot belong to O. Assume the contrary, and let X\ and X2 be the 
roots of the components C Xl and C X2 preceding and succeeding C u in O. If <j)\{x{) = <t>'i(xi) is an 
ancestor of 4>i(u) and X2 is a descendant of u, then (iii) and (iv) imply that 4>i(xi) is an ancestor 
of (p'i(u). If xi is an ancestor of u and ^1(^2) = 4>'i{x2) is a descendant of 4>i(u), then (iv) implies 
that 0i (^2) is also a descendant of (ft'i(u). In both cases, O would also be a cycle of Gf 2 ~e' because 
C z £ O, a contradiction. Thus, C u cannot belong to O. 

Since any cycle in Gf 2 +e has to contain at least one of C u and C z , and we have shown that 
there is no cycle in Gf 2 +e containing either C u or C z , F2 4- E must be acyclic, which is what we 
wanted to prove. □ 



4 MAF Algorithms 

In this section, we present our approximation and FPT algorithms for computing MAFs of rooted 
and unrooted trees. First we present our rooted MAF FPT algorithm in detail. Then we show 
how to modify the algorithm to give a 3-approximation in linear time. Finally we show how these 
algorithms can be modified to compute unrooted agreement forests. 

4.1 An FPT Algorithm for Rooted MAF 

As is customary for FPT algorithms, we focus on the decision version of the problem: "Given two 
A-trees T\ and T2, a distance measure d(-, •), and a parameter k, is d(T\,T2) < fe?" To compute 
the distance between two trees, we start with k = and increase it until we receive an affirmative 
answer. This does not increase the running time of the algorithm by more than a constant factor, 
as the running time depends exponentially on k. 
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Our FPT algorithm is recursive. Each invocation takes two forests Fx and F2 of T\ and T2 and 
a parameter k as inputs, and decides whether e(Ti,T2,F2) < k. We denote such an invocation by 
Maf (Fx, F2, k). The forest F\ is the union of a tree T\ and a forest F disjoint from Ti, while F% is 
the union of the same forest F and another forest F2 with the same label set as T±. We maintain 
two sets of labelled nodes: Rj contains the roots of F, and Rt contains roots of (not necessarily 
maximal) subtrees that agree between T\ and F 2 . We refer to the nodes in these sets by their 
labels. For the top-level invocation, Fx = T\ = Ti, F2 = F2 = T2, and F = 0; Rj is empty, and Rt 
contains all leaves of T\ . 

Maf (Fx, F 2 , k) uses the results from Section [3TT1 to identify a small collection {Fx, E2, . . . , E q } 
of subsets of edges of F2 such that e (Ti, T 2 , F2) < k if and only if e (Ti, T2, F 2 — Fj) < &; — |Fj|, for 
at least one 1 < i < q. It makes a recursive call Maf (Fx,F 2 — Ei, k — \Ei\), for each subset Fj, 
and returns "yes" if and only if one of these calls does. The steps of this procedure are as follows. 

1. (Failure) If k < 0, there is no subset E of at most k edges of F2 such that F2 — E yields an AF 
of T\ and T2. Return "no" in this case. 

2. (Success) If \Rt\ < 2, then F2 C T±. Hence, F2 = F2 U F is an AF of F\ and F2 and, thus, also 
of Ti and T2. Return "yes" in this case. 

3. (Prune maximal agreeing subtrees) If there is a node r £ Rt that is a root in F 2 , remove r from 
Rt and add it to Rd, thereby moving the corresponding subtree of F2 to F; cut the edge e r in 
T\ and apply a forced contraction to remove r's parent from Ti; return to Step[2j This does not 
alter F2 and, thus, neither e (Ti,T2, F2). If no such root r exists, proceed to SteplH 

4. Choose a sibling pair (a, c) in Ti such that a,c£ Rt- 

5. (Grow agreeing subtrees) If (a, c) is a sibling pair of T2, remove a and c from ; label their 
parent in both forests with (a, c) and add it to Rt ; return to Step El If (a, c) is not a sibling pair 
of F2 , proceed to Step [6j 

6. (Cut edges) Distinguish three cases (see Figured]): 

6.1. If a oo F2 c, make recursive calls Maf (Fx, F2 -j- {e a }, k — 1) and Maf (Fx, F2 -j- {e c }, — 1). 

6.2. If a ~i? 2 c and the path from a to c in F2 has one pendant node b, make a recursive call 
MAF(Fx,F 2 H-{e 6 },A;-l). 

6.3. If a ~i? 2 c and the path from a to c in F2 has q > 2 pendant nodes 61, 62, ... , b q , make three 
recursive calls Maf (Fx, F2 4- {e^ , e& 2 , . . . , e^}, fc — q) , Maf (Fx, F2 {e a }, A; — 1), and 
MAF(Fx,F 2 -{e c },A;-l). 

Return "yes" if one of the recursive calls does; otherwise return "no". 

As we argue in the proof of Theorem H] below, the correctness of this algorithm follows from 
Lemmas HI \5\ and [6j To obtain the time bound claimed in the theorem, we need the following 
lemma. 

Lemma 8. Each execution of Steps{l\\B\ takes constant time. Step{fi\takes linear time to distinguish 
between Cases [b\l\ \ 6.^ and \6.3\ and to create a copy of the input before modifying it and passing it 
as input to each recursive call. All other operations in Step\^take time proportional to the number 
of edges cut. 



18 




Step 6.1 Step 6.2 Step 6.3 

Separate Components One Pendant Node Multiple Pendant Nodes 



Figure 6: The cases in Step [6] of the rooted MAF algorithm. Only Fi is shown. Each box represents 
a recursive call. 

Proof. First we discuss how to represent the two forests to ensure that each step of the algorithm, 
except Step El takes constant time. We represent each forest as a collection of nodes, each of which 
points to its parent, left child, and right child. In addition, every labelled node (i.e., each node 
in Rt or Rj) stores a pointer to its counterpart in the other forest. For T\, we maintain a list of 
sibling pairs of labelled nodes. Every labelled node of T\ stores a pointer to the pair it belongs to, 
if any. For F2, we maintain a list R' d C R t of labelled nodes that are roots of F2. This list is used 
to move these roots from R t to R^- 

Steps [T] and [2] clearly take constant time. To implement Step El we test whether the root list 
R' d is empty. If it is not, we remove its first element r. Given the pointers in our representation of 
the input, it takes constant time to identify node r and its parent in T\, remove r from Rt, move 
it to Rd, cut e r in T\, and splice out r's parent. If r belonged to a sibling pair in J\, r stores a 
pointer to this pair, and we can eliminate the pair from the list of sibling pairs in constant time. 
The sibling a of r in T\ may now be the sibling of another node c in T\, and we may need to add 
the pair (a, c) to the list of sibling pairs. This is the case if a's new sibling c is labelled, which is 
also easily checked in constant time. Thus, we can check in constant time whether Step [3] applies 
and, if so, apply it. 

Step H] simply removes the next sibling pair from the list of sibling pairs in constant time. 

In Step we can test in constant time whether the sibling pair (a, c) is also a sibling pair of 
F2 by testing whether o and c have the same parent in F2. If so, we remove a and c from Rt, add 
their parent to Rt, and set up pointers between the two parents in T\ and F2. If p a has no parent 
in F2, it is now a root and is added to the root list R' d . Additionally, if p a has a labelled node d 
for a sibling in T\, then (p a ,c') is a sibling pair in T\, which must be added to the list of sibling 
pairs. All these steps necessary to implement Step [5] take constant time. 

In Step El we distinguish between Cases 16.1} 16.21 and 16.31 by walking up the paths from a and 
c to the roots of their components in F2 to determine whether they intersect; if they do, the same 
traversal can count the number of pendant edges of the two portions of these two paths between 
a and c and their lowest common ancestor. This takes linear time, as does copying the current 
representation of F\ and F2 before modifying it by cutting edges and passing it to each recursive 
call. Similar to Step El which cuts edges in F\ , each edge in F2 we cut in Step [6] can be cut 
in constant time, with an additional constant cost to add the bottom endpoint of each cut edge 
to R' d . □ 
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Corollary 1. Each invocation Maf (Fi, F2,k), excluding recursive calls it makes, takes linear time. 

Proof. By LemmaEl each execution of Steps HHS takes constant time, while StepEltakes linear time. 
Steps [T] and [6] are executed only once per invocation. Steps [2H5] form a loop, and each iteration, 
except the first one, is the result of finding a root of F2 in Step [3] or merging a sibling pair in Step El 
In the former case, Step [3] cuts an edge in F±, which can happen only O (n) times because Fl has 
O (n) edges. In the latter case, the number of nodes in Rt reduces by one, which cannot happen 
more than re times because the algorithm starts with the n leaves of T\ in Rt and the number of 
nodes in R t never increases. Thus, Steps [2HS] are executed O (re) times, and the cost of the entire 
invocation is linear. □ 

Theorem 4. For two rooted X-trees T\ and T2 and a parameter k, it takes O ((1 + \f2) k n} = 
O (2.42 fc re) time to decide whether e (Ti, T 2 , T 2 ) < k. 

Proof. By Corollary [H each invocation Maf (Fi, F2, k') takes O (n) time. Thus, the claimed running 
time follows if we can bound the number of invocations by O ((1 + \/2) fc ). The number of recursive 
calls spawned by an invocation depends only on k with the recurrence 



I{k) 



1 k < 

l + 2I(Jfe-l) Case ED 

1 + Case[ 

1 + 21 (k - 1) + I(k - q) Case[ 
< 1 + 2I(k - 1)+I(k - 2) 

because Case 16.31 dominates the other two cases and q > 2 in this case. Simple substitution shows 
that this recurrence solves to I(k) = O ((1 + \/2) k ). 

It remains to prove the correctness of the algorithm, which we do by induction on k. An 
invocation Maf (Fi, F2, k) with k < correctly returns "no" in StepHJ So assume k > 0. We prove 
that the invocation Maf (Fi, F 2 , k) returns "yes" if and only if e(Ti,T 2 ,F 2 ) < k. 

First assume Maf (Fi, F 2 , k) returns "yes". If this happens in Step this is correct because 
k > and F 2 is an AF of T\ an T 2 , that is, e(Ti, T 2 ,F 2 ) = < k. If the invocation does not 
produce an answer in Step[2j it produces an answer in Step El Since Maf (Fl,F 2 , k) returns "yes", 
there exists a recursive call Maf (Fl,F 2 -j- Fj, k — |Fj|) made in this step that returns "yes", By the 
induction hypothesis, this implies that e(Ti,T 2 ,F 2 -f- Fj) < k — |Fj| and, hence, e(Ti,T 2 ,F 2 ) < k. 
Thus, MAF (Fi, F 2 , k) correctly returns "yes" in this case as well. 

Now assume e (Ti, T 2 , F 2 ) < k. If Maf (Fi, F 2 , k) produces an answer in Step [2j the answer is 
"yes", which is correct. If the algorithm produces an answer in Step El Lemmas [H [5] and El show 
that, whichever case of Step [6] applies, there exists a recursive call Maf (Fi, F 2 4- Fj, k — |Fj|) such 
that Fj C F, for some set F of size e(Ti,T 2 ,F 2 ) and such that F 2 4- F is an AF of T\ and T 2 . 
Hence, e(T 1 ,T 2 ,F 2 4- EC) < |F| - |Fj| = e(Ti,T 2 ,F 2 ) - |F 2 | < k - \Ei\, which, by the induction 
hypothesis, implies that Maf(Fi,F 2 -j- Fj, k — |Fj|) returns "yes". Therefore, Maf (Fl, F 2 , k) also 
returns "yes". □ 

Using reduction rules by Bordewich et al. [8] , we can improve the running time of the algorithm 
for values of k such that k > 2 log 2 42 n and k = o (re) . Given two trees T\ and T 2 , these reduction 
rules take O (n 3 ) time to produce two trees T{ and T' 2 of size at most c- e (Ti, T 2 , T 2 ) each, for some 
constant c > 0, and such that e (T^T^T^) = e (Ti, T 2 , T 2 ). If one of the trees has size greater than 
cfc, then e(Ti,T 2 ,T 2 ) > fe, and we can answer "no" without any further processing. If both trees 
have size at most ck, we can apply Theorem 2] to T{ and to decide in O (2.42 fc fc) time whether 
e {^F'^T'^TlC) < k. Thus, we obtain the following corollary. 
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Figure 7: Step [6] of the (a) rooted and (b) unrooted MAF 3-approximation algorithms. Only F 2 is 
shown. Each such step cuts three edges, and reduces e(T\,T 2 ,F 2 ) by at least one. 

Corollary 2. For two rooted X-trees T\ and T2 and a parameter k, it takes O (2A2 k k + n 3 ) time 
to decide whether e(Ti,T2,T2) < k. 



4.2 A 3- Approximation Algorithm for Rooted MAF 

We now show how to modify the FPT algorithm for rooted MAF to obtain a linear-time 3- 
approximation algorithm. The algorithm is easy to implement iteratively, and this may be prefer- 
able in practice. In order to minimize the differences to the FPT algorithm from Section 14.14 
however, we describe it as a recursive algorithm. There are three differences to the FPT algorithm: 

• Since we do not want to decide whether e (Ti,T 2 ,T 2 ) < k, for some parameter k, but instead 
want to compute a value k' such that e(Ti,T2,T2) < k! < 3e (Ti, T 2 , T 2 ), we do not pass k 
as a parameter to the algorithm, and the algorithm does not return "yes" or "no" but the 
number of edges it cuts in F 2 to obtain an AF of T\ and T 2 instead. In particular, there is 
no need for Step [TJ and, whenever Step [2] would have returned "yes" in the MAF algorithm, 
it returns in the approximation algorithm, as F 2 is an MAF of T\ and T 2 in this case, and 
no edges need to be cut. 

• Step [6] does not distinguish between Cases [67TT 16.21 and 16.31 Instead, we replace it with the 
following simpler step, which makes only one recursive call. 

EJ (Cut edges) Make a recursive call Maf (Fi, F 2 {e a , e^, e c }), where b is a's sibling in F 2 , 
and return 3 plus its return value. 

Note that this step cuts three edges (e a , e^, and e c ) in addition to the edges cut by the 
recursive call Maf (Fi, F 2 -j- {e a , e^, e c }) and, thus, returns three more than the return value 
of this recursive call as the total count of cut edges. 

• For the correctness of Step [U it is important that we label the members of the sibling pair 
(a, c) so that a's distance from the root of T 2 is no less than c's distance from the root of T 2 , as 
this ensures that c ^ F 2 . To be able to do this in constant time per invocation Maf (Fi, F 2 ), 
we preprocess T 2 by labelling each of its nodes with its distance from the root. This can be 
done in linear time using a simple depth- first traversal of T 2 . 



Using these modifications (illustrated in Figure 7(a)), we obtain the following theorem 



Theorem 5. Given two rooted X-trees T\ and T 2 , a 3-approximation ofe (T\,T 2 ,T 2 ) = dspR (T\ , T 2 ) 
can be computed in linear time. 

Proof. We use the algorithm just described and output the final return value k' as the approximation 
of e (T 1 ,T 2 ,T 2 ). 
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Figure 8: Step [6] of the unrooted MAF algorithm. Only F 2 is shown. Each box represents a 
recursive call. 

First we prove that the algorithm's running time is O (n). Observe that Step [6] of the approxi- 
mation algorithm does not distinguish between different subcases, and it makes only one recursive 
call, making the copying of the current state of the algorithm as in the FPT algorithm unnecessary. 
Thus, by Lemma El each execution of any step of the algorithm, including Step EJ takes constant 
time, and it suffices to prove that each step is executed O (n) times. 

If we record the sequence of steps executed over all recursive calls of the algorithm, we observe 
that any two executions of Step [2] have at least one execution of Step El [5] or [6] between them. The 
same is true for any two executions of Step [H Thus, it suffices to bound the number of executions 
of Steps [31 [5l and [6l As argued in the proof of Corollary [fl each execution of Step [5] reduces the 
number of nodes in R t by one, while each execution of Step [3] or [6] cuts at least one edge in Fx 
or F 2 . Since T\ and T2 have O (n) edges and initially there are n nodes in Rt, this implies that 
each of these steps can be executed only O (n) times, and the linear time bound follows. 

To prove the approximation ratio of the algorithm, consider an invocation Maf (Fi, F2), and 
assume it makes r recursive calls, not counting the invocation Maf (Fx,F 2 ) itself. Then its return 
value is k' = 3r because only Step [6] invokes the algorithm recursively and this step increases k' by 
three. We prove by induction on r that e (Ti, T 2 , F2) < 3r < 3e (Tx, T 2 , F2). 

If r = 0, Maf (Fx,F 2 ) returns in Step [21 which implies that F2 is an AF of T\ and T2, that is, 
e (Tj, T2, F2) = 0, and 3r is a 3-approximation of e (Tx, T 2 , F2). 

If r > 0, Maf (Fi, F2) identifies a sibling pair (a,c) of Fx such that a and c are not roots nor 
siblings in F2 and makes a recursive call Maf (Fx, F2 -f- {e a , ej,, e c }), where b is a's sibling in F2, 
and Maf (Fi, F2 -j- {e a , e^, e c }) makes r — 1 recursive calls. By the induction hypothesis, we have 

e(Tx,T 2 ,F 2 {e a ,e b ,e c }) < 3(r - 1) < 3e(Tx,T 2 ,F 2 -r {e a ,e b ,e c }) . (1) 

The left-hand side of (pQ) immediately implies that e (Ti, T2, -F2) < 3r. By Theorem [H we have 
e (Ti, T 2 , F2 -r {e a , e b , e c }) < e (Tx, T 2 , F 2 ) — 1. Together with the right-hand side of ([I]), this gives 
3r < 3e (Tx,T 2 ,F 2 ), finishing the proof. □ 

4.3 Unrooted MAF Algorithms 

We now modify the FPT algorithm for computing rooted MAFs to work in the unrooted setting. 
In spite of the trees (and forests) being unrooted, it is natural to work with rooted subtrees of Fx 
and F2, as in the rooted case. In particular, we can view T\ as consisting of rooted subtrees whose 
roots are in Rt and are attached to an unexplored and unrooted "core" of Tx. This is no different 
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from the rooted case, except that the core was rooted in that case. To compute unrooted MAFs, 
we make the following modifications to the FPT algorithm from Section 14.11 

• In Step [5j it is not sufficient to check whether a and c are siblings in F2 . They may in fact be 
neighbours in F<i- Thus, we have three cases now: they are neighbours, they are siblings or 
they are neither neighbours nor siblings. We treat the last two cases as in the rooted case. If 
a and c are neighbours in F2, we split the edge between a and c in F2 by introducing a new 
node r adjacent to both a and c. By declaring a and c to be children of r, the component 
containing a and c becomes a rooted tree, with root r, and a and c become siblings in F2. 
This allows us to finish handling this case as when a and c are siblings in F2. 

• Step [6] applies when a and c are neither neighbours nor siblings in F2 and neither is a root 
of F2: Step [6] is reached only after Step [3] has pruned all nodes in R t that are roots in F2. 
Thus, both a and c have siblings b and d in F2, and we implement Step [6] as follows; see 
Figure M 

EJ (Cut edges) Make recursive calls Maf (Ti, F 2 -f- {e a }, k - 1), Maf {F l ,F 2 -r {e b }, k - 1), 
Maf (F 1 ,F 2 4- {e c }, k - 1), and Maf (F 1 ,F 2 4- {e d }, k-1). Return "yes" if one of these 
calls does; otherwise return "no". 

Theorem 6. For two unrooted X-trees T\ and T2 and a parameter k, it takes O (4 fc n) time to 
decide whether e (Ti, T2, T2) < k. 

Proof. The running time of the algorithm is established as in Theorem U observing that each 
invocation makes four recursive calls with the parameter k reduced by one. 

The correctness also follows using the arguments used to prove Theorem U using Theorem [2] 
instead of Lemmas HJ and [6] to prove the correctness of Step [6] and observing that, while Step 
alters F2 in the case when a and c are siblings, the splitting of an edge in F% does not alter 
e(T 1 ,T 2 ,F 2 ). □ 

As in the rooted case, we can use known kernelizations [lj to transform the trees T\ and T2 into 
two trees T[ and T' 2 of size O (e (Ti, T'2, T2)) and with e (T{, T%, T' 2 ) = e (Ti, T2, T2). Using the same 
argument as in Section [4.11 this gives the following corollary. 

Corollary 3. For two unrooted X-trees T\ and T2 and a parameter k, it takes O (4 fc /c + n 3 ) time 
to decide whether e (Ti,T2, T'2) < k. 

We can use the linear-time 3-approximation algorithm for rooted MAF from Section T4. 21 to also 
compute a 3-approximation of an unrooted MAF, after making the following modifications. 

• In Step[5j we have to account for the possibility of a and c being neighbours, which we do in 
the same way as in the FPT algorithm for unrooted MAF just described. 

• In Step[6l we may cut the third edge incident to a and b's common neighbour instead of e&; 
see Figure 7(b)[ 



[6J (Cut edges) Make a recursive call Maf (Ti, F2 -r- {e a , e&* , e c }), where b* is a node at distance 
two from a and not in F% , and e&* is the edge connecting b* to the neighbour it shares 
with a. Return 3 plus the return value of the recursive call. 

As in the rooted case, the algorithm is easy to implement iteratively, but we presented it as a 
recursive algorithm here to make it use the same framework as the FPT algorithm. 
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Theorem 7. Given unrooted X-trees T\ and T2, a 3 -approximation of e (Ti, T2, T2) = dxBR ^2) 
can be computed in linear time. 

Proof. The linear running time of the algorithm is established as in the rooted case. The approx- 
imation ratio also follows as in the rooted case if we can prove that e (T\,T2, F2 -j- {e a , e&* , e c }) < 
e(Ti,T 2 , F 2 ) — 1. If b* = b, this follows from Theorem [3j It may happen, however, that b* is the 
third neighbour of the common neighbour of a and b in F2 2 In this case, we observe that the 
edges e = e&* and / = e& satisfy Lemma [T] with respect to the edge set E = {e a ,eb,e c }. Hence, 
F 2 — {e a , &bi e c } and F2 — {e a , e\,* , e c } yield the same forest, that is, e (Ti,T 2 , F2 -r- {e a ,eb* , e c }) = 
e(T 1 ,T 2 ,F 2 + {e a ,e b ,e c })<e(T 1 ,T 2 ,F 2 )-l. □ 

5 Refining an Agreement Forest to an MAAF 

In this section, we develop an efficient FPT algorithm for computing an MAAF of two rooted 
phylogenies and, thus, for computing their hybridization number. The first part of this algorithm 
is (a slight modification of) the FPT algorithm for computing an MAF; as noted in Section 13.31 
we can use this algorithm to make progress towards an MAAF as long as the current forest is not 
an agreement forest of the trees T\ and T2, that is, we can obtain an agreement forest that can be 
refined to an MAAF by cutting additional edges. The second part of the algorithm is an efficient 
procedure for computing such a refinement of an agreement forest. 

To develop this refinement procedure, we need several new ideas. Each of the following sections 
discusses one of them. The tools introduced in Sections I5.1H5.3I suffice to obtain a fairly simple 
refinement procedure that leads to an MAAF algorithm with running time 0(9.68 fe n), which 
already is a substantial improvement over the currently best MAAF algorithm. Sections 15.41 and 15. 51 
then introduce two refinements that improve the algorithm's running time first to O (4.84 fe n) and 
then to O (3.18 fe n). 

Specifically, our first tool, presented in Section [5TT1 is an expanded cycle graph G* F . The central 
step in obtaining an AAF from an AF is breaking cycles in the cycle graph Gf by eliminating at 
least one component from each cycle. A component is removed from a cycle by cutting edges in 
the component. In G* F , every node of Gp is replaced by the component of F it represents, which 
allows us to identify exactly which edges in a component need to be cut in order to eliminate it 
from a given cycle. Moreover, if F has k' + 1 components, G* F contains only 2k' of the edges of Gf, 
which ensures that G* F has size O (n), while G* F may have size O Uk') 2 ). 

Our second tool is a careful analysis of the structure of a cycle in G* F , presented in Section [5T21 
This analysis allows us to identify the components of a cycle that are essential in the sense that 
at least one of them has to be eliminated to break the cycle (as opposed to only replacing it with 
a shorter cycle). We identify one node in each such component, called an exit node, and show that 
each cycle has at least one essential component whose exit node in this cycle has the property that 
cutting each edge on the path from v to the root of the component reduces e (T\,T 2 , F) by the 
number of edges cut. We call the process of cutting these edges fixing the exit node. 

Our third tool, presented in Section 15.31 is a procedure that marks a subset of the nodes of F. 
Every exit node of an essential component of a cycle in G* F is marked by this procedure, which is 
why we call these marked nodes potential exit nodes. We show that we mark at most 2k nodes and 
that, if the current forest F can be refined to an AAF of T\ and T2 with at most k + 1 components, 
then fixing an appropriate subset of potential exit nodes gives such an AAF. For each such subset, 

2 After a linear-time processing, we could identify b in constant time in each step, allowing us to enforce that we 
cut e a and ei, in each step. As we show here, this is not necessary. 
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we can test in linear time whether fixing it gives an AAF with at most k + 1 components. Thus, by 
testing every possible subset of these 2k nodes, we obtain a refinement algorithm with running time 
O (4 fc ra) . Combined with the bounded search approach that forms the first part of our algorithm, 
this leads to an MAAF algorithm with running time O (9.68 fc n). Indeed, the first part of our 
algorithm takes O (2.42 fc n) time to produce a candidate set of at most 2.42 fc agreement forests such 
that, if e (T\, T%, T%) < k, at least one of them can be refined to an AAF of T\ and T2 with at most 
k + 1 components. For each such forest F, we invoke the refinement procedure, which determines 
in O (4 fc n) time whether F can be refined to such an AAF. 

The bound of 2k on the number of marked nodes is obtained quite naturally because these 
nodes are essentially the top endpoints of the edges in T\ and T2 cut to obtain the agreement forest 
F and we cut at most k edges in each tree. Since we can obtain F from both T\ and T2 by cutting 
the edges connecting the roots of the components of F to their parents in these trees, we can form 
pairs of edges by associating the two parent edges of each component root of F with each other. 
Since the marked nodes are the top endpoints of edges in T\ and T2 we cut to obtain F, these pairs 
of cut edges naturally also define pairs of potential exit nodes. Our first improvement over the 
simple refinement algorithm we outlined so far uses the bounded search algorithm to mark nodes 
as it progresses towards F and exploits the information this provides about how F is obtained from 
T\ and T2 to mark only one node in each pair of potential exit nodes. The refinement step then 
considers only marked potential exit nodes. Since the number of pairs of potential exit nodes and, 
hence, the number of marked potential exit nodes is at most k, this reduces the running time of the 
refinement step to O (2 k nj , and the running time of the whole algorithm to O (4.84 fc n) . Section \5. 41 
discusses this improvement. 

In Section 15.51 finally, we obtain our final algorithm with running time O (3.18 fc n). So far, we 
allowed both parts of the algorithm — finding an AF that can be refined to a small AAF and the 
refinement step — to cut k edges. However, k is the total number of edges we are allowed to cut. 
Thus, if the number k' of edges we cut to obtain an AF is large, there are only k" := k — k! edges 
left to cut in the refinement step, allowing us to restrict our attention to small subsets of marked 
potential exit nodes and thereby reducing the cost of the refinement step substantially. If, on the 
other hand, k! is small, then only a small number of nodes in the agreement forest are marked 
and even trying all possible subsets is not too costly. Our final algorithm in Section 15.51 uses an 
improved refinement step that takes advantage of these facts. 



5.1 An Expanded Cycle Graph 

The expanded cycle graph G* F of an agreement forest F of two rooted phylogenies T\ and T2 is a 



supergraph G* F D F with the same vertex set as F; see Figure 9(c) Let E\ and E2 be minimal 
subsets of edges of T\ and T2 such that F = T\ + E\ = T2 4- E%. In addition to the edges of 
F, G* F contains one hybrid edge per edge in E\ U E2- To define these edges, we define mappings 
from nodes of F to nodes of T\ and T2 and vice versa. As in the definition of the original cycle 
graph Gf, we map each node x in F to nodes (j>i(x) in T\ and fa (x) in T2 such that (f>i(x) is the 
lowest common ancestor of all labelled leaves in Tj that are descendants of x in F. For the reverse 
direction, we define a function (•); (jh {x) is defined if and only if x is labelled or belongs to the 
path between two labelled nodes a and b such that a b. In this case, (x) is the node in F 
that is the lowest common ancestor of all labelled leaves that are descendants of x in Tj and such 
that the path between x and each such leaf does not contain any edges in E{. These mappings are 
well defined in the sense that cj)~ (4>i(x)) = x, for all x 6 F and % 6 {1, 2}. 

The hybrid edges in G* F are now defined as follows. There are two such edges per root node y 
of F, except p, one induced by T\ and one induced by T<i- Let Zj be the lowest ancestor of 4>i(y) in 
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Figure 9: (a) Two trees T\ and T^- (b) An agreement forest F of Ti and T2 obtained by cutting the 
dotted edges in T\ and T2, and its cycle graph Gf- The component of F represented by each node 
of Gf is drawn inside the node, (c) The expanded cycle graph G* F . Dotted edges are Ti-hybrid 
edges, dashed ones are IVhybrid edges. 



Tj such that 4>^ 1 (zi) is defined. Then (07 (zi), y) is a T\-hybrid edge and ((j)^ 1 (^2) > y) is a T%-hybrid 
edge. See Figure 9(c) for an illustration of these edges. Note that neither 07 [z\) nor 07 (^2) is 
a root of F. Our first lemma shows that the forest F is an AAF of Ti and T2 if and only if G* F 
contains no cycles, that is, that we can use G* F in place of Gf to test whether F is acyclic. 

Lemma 9. G* F contains a cycle if and only ifGF does. 

Proof. First observe that G* F can be obtained from Gf by choosing a subset of the edges of Gf 
and then replacing each vertex of Gp with a component of F. Since the components of F do not 
contain cycles, this shows that G* F is acyclic if Gf is. 

Conversely, for two nodes u and v of F, G* F contains a path from u to v if <j>i(u) is an ancestor 
of (f>i(v) or 4>2(u) is an ancestor of 02 (v). This implies that every edge in Gf can be replaced by a 
directed path in G* F , so that G* F contains a cycle if Gp does. □ 

In the remainder of this subsection, we show that G* F can be constructed in linear time from 
Ti, T2, and F, a fact we use in our algorithms in Sections 15.31 and 1 5.41 
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Lemma 10. The expanded cycle graph G* F of an agreement forest F of two rooted phylogenies T\ 
and T2 can be computed in linear time. 

Proof Our construction of G* F starts with F and then adds the hybrid edges. To add the hybrid 
edges induced by Ti, we perform a postorder traversal of Ti that computes the mappings 4>\{-) and 
4>i (•), and the hybrid edges induced by T±. A similar postorder traversal of T2 then computes 
02(0 , 4>2 an d the hybrid edges induced by Ti. 

We can assume each labelled node (leaf) of T\ or T2 stores a pointer to its counterpart in F 
and vice versa. Thus, for each leaf x, 4>i(x), (f>i 1 (x), and 4>2 1 {x) are given. In addition, we 

associate a list L x with each leaf x, where L x := {x} if x is a root of F, and L x = otherwise. In 
general, after processing a node x, L x stores the set of roots of F that map to descendants of x 
and have proper ancestors of x as the tails of their Ti-hybrid edges. (It is not hard to see that this 
is the same ancestor of x, for every root in L x .) 

After setting up this information for the leaves of T\ , the postorder traversal computes the same 
information for the non-leaf nodes of Tl and uses it to compute the Ti-hybrid edges in G* F . For a 
non-leaf node x with children / and r, the mappings 0j" 1 (/) and <f)~^ l (r) and the root lists L\ and 
L r of I and r are computed before processing x. Hence, we can use them to compute the mapping 
<p^ 1 (x) and the root list L x . We distinguish four cases. 

If neither 0~ 1 (Z) nor <\)-y (r) is undefined or a root of F, then they must have a common parent 
p in F (because I and r are siblings in Tl and F is a forest of Ti). In this case, we set 0^~ 1 (x) = p 
and 4>i(p) = x. If p is a root other than p, we set L x = {p}; otherwise L x = 0. 

If both ft 1 (I) and ^(r) 

are undefined or a root of F, then <p 1 {x) is undefined (as x can 
belong to a path between two labelled nodes a and b such that a ~^ b only if this is true for at 
least one of its children) and we set L x = Li U L r . 

If only <t>i l {l) is undefined or a root of F, we set 0^ 1 (x) := 1 " 1 (r) and add a Ti-hybrid edge 
(<f>^ 1 (x),y) to G* F , for each root y in L\. Then we set L x = (x cannot be the image 4>i(x') of a 
root x' of F and L r = in this case) . 

The final case where only <^>|f 1 (r) is undefined or a root of F is symmetric to the previous case. 

It is easy to see that this procedure correctly constructs G* F because it directly follows the 
definition of G* F . The running time of the algorithm is also easily seen to be linear. Indeed, 
computing the mappings <f>^ 1 (x) and possibly <f>i(p) from (j>i l (l) and </>|f 1 (r) takes constant time 
per visited node x, linear time in total. In the case when L x is computed as the union of L\ and L r , 
L[ and L r can be concatenated in constant time. In the case when we add a hybrid edge to G* F , for 
each edge in L; or L r , this takes constant time per edge, and we then pass an empty list L x to x's 
parent. The latter implies that every root added to a list L x leads to the addition of exactly one 
hybrid edge to G* F . Since every node adds at most one root to L x that is not already present in 
Li or L r , this shows that the addition of hybrid edges to G* F also costs linear time in total for all 
nodes of T±. The running time of the traversal of T2 is bounded by O (n) using the same arguments. 
Hence, the entire algorithm takes linear time. □ 

One thing to note about the algorithm for constructing G* F is that it does not require knowledge 
of the edge sets E\ and E2, even though we used these sets to define G* F . This implies in particular 
that, even though there may be different edge sets E\ and E2 such that T\ 4- E\ = T2 4- E2 = F, 
all of them lead to the same cycle graph — G* F is completely determined by F alone. 

5.2 Essential Components and Exit Nodes 

In this subsection, we define the essential components of a cycle in Gf and their exit nodes. Our 
goal is to prove that F can be refined to an AAF of Ti and T2 with at most k + 1 components 
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Figure 10: (a) Two trees T\ and T%. (b) An agreement forest F of T\ and T<i- (c) G* F (with p's 
component removed for clarity) contains a cycle of length 4. White nodes indicate exit nodes, (d) 
Fixing the exit node of component C4 (cutting the bold edges from the exit node to the root of its 
component) removes the cycle because none of the resulting subcomponents of C4 is an ancestor 
of C\ in T2. 



exclusively by cutting the edges on the paths from exit nodes to the roots of their components in F. 

Let Hi be the set of Xi-hybrid edges in G* F , and H2 the set of T2-hybrid edges in G* F , and 
assume G* F contains a cycle O. Let ho, hi,... , h m ~i be the hybrid edges in O, and consider the 
components Co, C\, . . . , C m _i of F connected by these edges. More precisely, using index arithmetic 
modulo m, we assume the tail and head of edge hi belong to components C and Cj+i, respectively. 
Note that the cycle O enters Cj at its root and leaves it at the tail of the edge hi. We say a 
component Cj is essential for O if hi—i S Hi and hi S H2 or vice versa. We say a component C 
of F is essential if it is essential for at least one cycle in G* F . A node x of a component C of F is 
an exit node of C if C is an essential component C, for some cycle O in G* F and x is the tail of 



hi in this cycle. Figure 10(c) illustrates these concepts. Our first result in this section shows that 
there exists an essential component and an exit node of this component so that cutting its parent 
edge in F reduces e(Ti,T2,F) by one, that is, by cutting this edge, we make progress towards an 
MAAF of Ti and T 2 . 

Lemma 11. Let O be a cycle in G* F , let Co, Ci, . . . , C m _i be its essential components, and let 
Vi be the exit node of component Ci in O, for all < i < m — 1. Then e(Ti,T2,F -j- {e^}) = 
e (Ti, T<i, F) — 1, for some < i < m. 
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Proof. Let E be an arbitrary edge set of size e (Ti, T<i, F) such that F' := F 4- E is an AAF of Ti 
and T2. If D {e„ , e vi , . . . , e Vm _ 1 } / 0, the lemma holds. If E Pi {e„ , e 1 , 1 , . . . , e„ m _ 1 } = 0, we show 
there exists an edge f £ E such that F' = F 4- (E\ {/} U {e^}), for some < i < m, which again 
proves the lemma. Let rj be the root of component Cj, for all < i < m. To avoid excessive use 
of modulo notation in indices, we define Tj, </>«(•), etc. to be the same as T 2 _(j mo< j 2)) 4>2-{i mod 2)(")> 
etc. in the remainder of this proof. 

First suppose there exist leaves G C"* and Cj G Cj \ C^ 1 such that a, ~p/ q, for all < i < m, 
and let ^ be the LCA of aj and q in T'. Further, for every node x and for i G {1,2}, let (j)i(x) 
and 0'j(a;) be the nodes in Ti x maps to based on its descendants in F and F' , respectively. Since 
Co, Ci, . . . , C m _i are the essential components of O, to is even and, w.l.o.g., the hybrid edge with 
head is T,_i-hybrid and the hybrid edge with tail Vi is Tj-hybrid. This implies that the lowest 
ancestor Xi of <fti(ri + i) such that (j)7 (x^) is defined and belongs to Cj satisfies 4>~ 1 (xi) = V{. 

Now observe that ^[(h) is a descendant of (pi(ri) and an ancestor of Xi in Tj. The former is 
obvious because (j>i(ri) is an ancestor (/>'j(rj) in Tj and k is a descendant of rj in F', that is, (^[{h) 
is a descendant of 4>i(ri). The latter follows because 4>i(di) = (pi(ai) is a descendant of Xj, while 
4>'i(ci) = <j)i{ci) is not. Since Xi is an ancestor of cpi(ri + i), for all i, this implies that <^(?i) is an 
ancestor of for all i, which shows that the components of F' containing these nodes form 

a cycle in Gpi, contradicting that F' is acyclic. 

Thus, there exists a component Ci such that a 00 F , c, for all labelled leaves a £ C\ l and 
c G Ci\ C^\ If a 00 F i Vi, for all a £ C^\ we choose an arbitrary leaf a' G C\ l and let / be the edge 
in E on the path from a! to Vi closest to t>j. Since a 00 F , Vi, for all a G C^*, this edge / and the 
edge e = e Vi satisfy the conditions of Lemma Q] and, hence, F J ^E = F J ^(E\ {/} U {e^}) is an 
AAF of Ti and T 2 . 

If there exists a leaf a' G C^ 1 such that a' ~pv t> i , then a' ^ p' c, for all c G Ci \ C?* , implies that 
00 ^/ c, for all c G Cj \ . So we can again choose an arbitrary leaf d G Cj \ C"' and the edge / in 
£" closest to Vi on the path from Vi to c; Lemma[T]then shows that F+E = F 4- (E\{f}U {e Vi }) . □ 

This lemma could be used directly as a basis for a bounded search approach to cycle breaking, 
choosing an arbitrary cycle in Gf and making one recursive call per component Ci in this cycle 
whose input is F -f- {e Vi }. This would require O (x k n) time, where x is the length of the longest 
cycle in Gf- Since x is bounded by k, the running time of the refinement step would therefore 
become O (k k n) . Similarly, we can obtain an x-approximation algorithm based on this idea. The 
following corollary is central to obtaining an improved refinement step with running time O (4 fc n). 
It shows that there must be some essential component Ci such that every edge on the path from 
its exit node to its root should be cut. As mentioned in the introduction to this section, we call 
cutting the edges on this path fixing the exit node. Removing a cycle by fixing an exit node is 



illustrated in Figure 10(d) 



Corollary 4. Let O be a cycle in G* F , let Co,C\, . . . ,C m _i be its essential components, let Vi be 
the exit node of component Ci in O, let Fi be the forest obtained from F by fixing Vi, and let ti 
be the length of the path from Vi to the root of Ci, for all < i < m — 1. Then e (Ti, T2, i*j) = 
e (Ti, T2, F) — ti, for some < i < m. 

Proof. The proof is by induction on e (Ti, T2, F). As a base case, suppose e (Ti, T2, F) = 1. Then, 
by Lemma [TTT there exists some exit node Vi such that e (Ti, T2, F 4- {e Vi }) = e (Ti, T2, F) — 1 = 0. 
Cutting e Vi splits Ci into two components Ai and Bi containing the leaves in C^ 1 and in Cj \ C% % , 
respectively. The path from Vi to rj in Ci cannot contain any edges apart from e Vi , as otherwise 
Co, Ci, . . . , Ci-\,Bi, Cj+i, . . . C m _i would form a cycle in Gp-{e v }■ Thus, the corollary holds for 

^(^^2,^ = 1. 
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Now suppose the corollary holds for all forests F' such that e(Ti, T^F') < e(Ti,T2, F). By 
Lemma HU there exists some Q such that e (Ti, T2, F {e^}) = e(Ti,T2,F) — 1. Again, cut- 
ting e„ 4 splits Cj into two components and Fj containing the leaves in and C, \ re- 
spectively. If = 1, then the path from Vi to consists of only e^., and the corollary holds. 
Otherwise C' ,C[,..., C^_ 1 , C?, C| +1 , . . . := C , Ci, . . . , Fj, C i+1 , . . . C m _i is a cycle O' 

in Gp/, where F' := F — {e„ 4 }. Note that, for j / i, the exit node t^- of C'- in O' is Vj-; the exit 
node ^ of C[ is tVs sibling in Q. By the inductive hypothesis, there exists some Cj, < j < m, 

such that e (t%,T2, Fj\ = e(T%,T2, F') — £j, where Fj is obtained from F' by fixing v'j and t'j 
is the length of the path from v'j to the root of Cj. In particular, ^ = £j, for j / i; and 
£\ = ii — 1. Thus, if j 7^ i, the corollary holds because there exists an edge set E' of size 
e(Ti,T2,F') such that F' F' is an AAF of Ti and T2 and such that the edges on the path 
from v'j to the root of Cj belong to F'; hence, the set F := F' U {e„ t } is an edge set of size 
e (Ti , T2 , F) such that F 4- F is an AAF of T% and T2 and such that the edges on the path from 
Vj to the root of Cj belong to F, that is, e(Ti,T2,Fj) = e(Ti,T2,F) — £j. If j = i, we observe 
that cutting e Vi and then fixing v[ in F' produces the same forest F[ as fixing V{ in F, that is, 
e(Ti,T 2 ,F) = e(Ti,T 2 ,F/) = e(T 1 ,T 2 ,F') - l\ = e(T 1 ,T 2 ,F) - U Thus, the corollary holds in 
this case as well. □ 



5.3 Marking Potential Exit Nodes — A Simple Refinement Algorithm 

Using the ideas developed so far, we can obtain a first (and fairly simple) FPT algorithm for 
computing an MAAF. As outlined at the beginning of this section, our algorithm consists of two 
parts. The first part is the MAF algorithm from Section \4. 1 1 modified to apply Lemma [7] instead of 
Lemma[5]in Step l6.2l of the algorithm. This is shown on page 1331 (with the addition of node marking 
and an additional case in Step 16.21 as a basis of the improved algorithm in Section 15.41 which the 
reader can ignore for now). When the algorithm reaches an invocation Maaf (Fi, F2, k') such that 
F2 is an agreement forest of Ti and T2 and k' > 0, it would output "yes" in the case of the MAF 
algorithm. Since this AF may contain cycles, the hybridization algorithm does not answer "yes" 
immediately and instead invokes a refinement algorithm that takes O (4 fc n) time to decide whether 
F2 can be refined to an AAF of Ti and T2 with at most k + 1 components. As discussed at the 
beginning of this section, the resulting algorithm has running time O (2.42 fc • 4 k n) = O (9.68 fc n) 
because the bounded search part of the algorithm takes O (2.42 fc n) time and produces at most 
2.42 fc agreement forests for which it invokes the refinement step. 

The refinement algorithm is rather simple. Given an agreement forest F of Ti and T2, we 
mark all those nodes of the components of F that have the potential of being exit nodes. Rather 
naturally, we call these nodes potential exit nodes. A node u is a potential exit node if it is the tail 
of a hybrid edge in G* F . This implies in particular that every exit node is a potential exit node and 
that there are 2(k' — 1) potential exit nodes if F has k' components. If F is a forest produced by the 
branching part of our algorithm, it has at most k + 1 components and, thus, at most 2k potential 
exit nodes. Below we show that F can be refined to an AAF of Ti and T2 with at most k + 1 
components if and only if there exists a subset of potential exit nodes such that fixing these nodes 
produces such an AAF. Thus, we consider every possible subset of potential exit nodes and test 
whether fixing this subset produces an AAF of Ti and T2 with at most k + 1 components. Testing 
this takes linear time: we inspect each chosen potential exit node in turn and traverse the path 
to the root of its component in the current forest, cutting each traversed edge in constant time; 
since we can cut only O (n) edges, cutting these edges takes linear time, as does computing and 
counting the connected components of the resulting forest. By Lemma 110} we can construct the 
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expanded cycle graph of the resulting forest in linear time, and we can use any standard linear-time 
topological sorting algorithm to test whether this expanded cycle graph is acyclic. Thus, the total 
running time of the entire refinement algorithm is (4 fc ra) because there are at most 2 2k = 4 fc 
subsets of potential exit nodes to test. 

Our main lemma in this section is Lemma PT2l below, which shows that the set of potential exit 
nodes of the forest obtained by fixing a potential exit node in F is a subset of F's potential exit 
nodes. This immediately proves the correctness of our refinement algorithm. Indeed, by Corollary^ 
if F can be refined to an AAF of Ti and T2 with at most k + 1 components, this can be done by 
repeatedly choosing the right exit node in the current forest and fixing it. Since every exit node is 
also a potential exit node, Lemma PT21 shows that the exit nodes fixed in this process are among the 
potential exit nodes of F and, hence, will be considered as a subset of potential exit nodes to fix. 
Now it suffices to observe that fixing a subset of exit nodes one node at a time produces the same 
forest as simultaneous cutting all edges in the union of the paths from these exit nodes to the roots 
of their components in F, which means that the order in which we fix a chosen subset of potential 
exit nodes is irrelevant. 

Lemma 12. Let F be an agreement forest of two trees T\ and T2, let V be the set of potential exit 
nodes of F, and let v be an arbitrary node in V . Let F' be the forest obtained from F by fixing v, 
and let V be the set of its potential exit nodes. Then V' C V . 

Proof. Since fixing v removes v's parent edge, v is a root of F' , which implies that v ^ V because 
potential exit nodes are not component roots. Thus, V ^ V, and it suffices to prove that V' C V, 
that is, that every potential exit node u of F' is also a potential exit node of F. So let u G V, 
and let (u, w) be a hybrid edge in G* F , with tail u. Assume w.l.o.g. that (u, w) is a Ti-hybrid 
edge, let <j>i(-) and be defined as before with respect to F, and let <fr'±(-) and be the 

same mappings defined with respect to F'. By the definition of a hybrid edge, w is the root of a 
component of F' and u = (f)'^ 1 (x), for the lowest proper ancestor x of <f>i(w) such that (f>'-T 1 (x) is 
defined. 

Now let Ei C E[ be edge sets such that F = T x -r- E x and F' = T x -=- E[, and let E be the 
set of edges cut in F by fixing v, that is, F' = F 4- E. We prove that a ~7\-.Ei x if and only if 
a ~Ti—E' x, for every labelled leaf a E Tf. This implies in particular that = fij 1 ^), for all 

nodes y G Tf such that x ~Ti-_Bi V- 

Clearly, if a ~Ti-E[ x i then a ~T\-Ei x because E\ C E[. So assume a x but 

a r^Xi-E' x i f° r some labelled leaf a G Tf. Since (f>'^ {x) is defined, there exist labelled nodes b and 
c such that b c and x is on the path from b to c in T x . This implies that b ~ Tl _ E ' i x r ^T 1 -E[ c 
and, hence, b ~Ti-_Ei x ~Ti-Si c. Together with a ~Ti-Bi x, this implies that a, b, and c belong 
to the same connected component of Xi — E x and, hence, to the same connected component of F, 
while a belongs to a different connected component of F' than b and c. Now observe that, since x 
is an ancestor of a and is on the path from b to c, the lowest common ancestor of b and c in Ti is 
an ancestor of a. Since F is a forest of Ti, this implies that the lowest common ancestor / of b and 
c in F also is an ancestor of a. Since b I ~pi c and a oo F , c, the path from a to I must contain 
at least one edge in E. By the choice of E, this implies that one of the child edges of I also belongs 
to E and, hence, that b oo F _ E c, a contradiction because F' = F 4- E and b ~p/ c. 

To finish the proof, let y be the first node after x on the path from x to <f>i(w) and such that 
0f 1 (y) is defined. Since (f)'^ 1 ((^[(w)) = w, and, hence, (p^ 1 ((^[(w)) is defined, that is, 

such a node y exists. If x oo Ti _ Ei y t then </>^ 1 (y) is a root of F and ((f)^ 1 (x), (f)^ 1 (y)) is a hybrid 
edge in G* F . Since (f>7 = (f)'^ 1 (x) = u, this proves that u is also a potential exit node of F. If 
x y, then = (/>^ 1 (y), that is, (j)'^~ 1 (y) is defined. By the choice of x, this implies that 
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y = 4>'i(w). Since (/> / 1 _1 (0 / 1 (w)) is denned, there exists a leaf a E T 1 1 w such that a r ^r 1 -E[ 4>'i( w ) 
and, hence, a ~pr w and a Together with 4>'i(w) ^t 1 -e 1 x, the latter implies 

that a ~Ti-Bi while (u,w) being a hybrid edge implies that u oo F , w and, hence, a oo F , u and 
a r >°Ti-E' 1 x - This is a contradiction, that is, the case x ~Ti-_Ei 2/ cannot occur. This finishes the 
proof. □ 

Using Lemma \12\ we can now prove the following result. 

Theorem 8. For two rooted trees T\ and T2 and a parameter k, it takes O (9.68 fc n) time to decide 
whether e (Ti, T2, T2) < k. 

Proof. First we bound the running time of our algorithm. The running time of the bounded search 
tree part of the algorithm is O (2.42 fc n) and it finds at most 2.42 fc agreement forests, thus spawning 
at most 2.42 fc refinement operations. Each refinement operation requires O (4 fe n) time. Thus, 
the running time of the algorithm is O (2.42 fc n + 2.42 fc • A k nj = O (9.68 fc n) . We now prove its 
correctness. 

First suppose e (T±, T2, T2) > k. The algorithm returns "yes" only if it finds an acyclic agreement 
forest with k + 1 or fewer components. So the algorithm correctly returns "no" in this case. 

Now assume e (T\,T2,T2) < k. By Lemmas [U El and [3 the bounded search algorithm finds an 
agreement forest F such that F + E is an MAAF of T\ and T2, for some edge set E. By Corollary [4] 
and Lemma PT2"| there exists an MAAF of T\ and T2 that can be obtained by choosing a subset of 
potential exit nodes of F and fixing each of the chosen nodes in turn. So the algorithm finds this 
MAAF, which has at most k + 1 components, when trying this subset of potential exit nodes, and 
correctly returns "yes". □ 



5.4 Halving the Number of Potential Exit Nodes 

Since our search for a set of exit nodes we can fix to obtain an AAF of T\ and T2 with at most 
k + 1 components tries all possible subsets of potential exit nodes, a reduction in the number of 
potential exit nodes we need to consider immediately leads to a corresponding improvement in the 
running time of the refinement algorithm. In this section, we show how to halve the number of 
potential exit nodes we need to consider. Recall that the root of every component of F, except p, 
is the head of two hybrid edges, one Ti-hybrid, the other T2-hybrid, and the tails of these edges 
are potential exit nodes. Our strategy is to mark potential exit nodes while the bounded search 
algorithm produces F, and use the information this provides to mark only one of these two tails as 
a potential exit node. Thus, each refinement step has to consider at most k potential exit nodes, 
reducing its running time from O (4 fc n) to O (2 k nj . 

In general, the result of marking only a subset of potential exit nodes is that we may obtain an 
agreement forest F of T\ and T2 that can be refined to an AAF of T\ and T2 with at most k + 1 
components but cannot be refined to such an AAF by fixing any subset of the potential exit nodes 
marked in F. Intuitively, the reason why this is not a problem is that we can show that, whenever 
we reach such an AF F where a potential exit node u should be fixed but is not marked, there 
exists a branch in the search for AFs that starts by cutting the parent edge e u of u after cutting a 
subset of the edges cut to produce F. Thus, if it is necessary to fix u in F to obtain an AAF F' 
of T\ and T2 with at most k + 1 components, there exists an alternate route to obtain the same 
AAF F' by first producing a different agreement forest F" and then refining it. While this is the 
intuition, it is in fact possible that our algorithm is unable to produce F' also along this alternate 
path. What we do prove is that there exists an agreement forest Fc produced by the bounded 
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search algorithm that can be refined to an AAF F' of Ti and T2 with at most fc + 1 components 
and such that the potential exit nodes that need to be fixed in Fq to produce F' are marked. 

We accomplish the marking of potential exit nodes as follows. The bounded search algorithm 
assigns a label "Ti" or "T2" to each component root other than p of each agreement forest F it 
produces. The refinement algorithm applied to F then constructs G* F , marks the tail of the T»- 
hybrid edge with head r if r's label is "Tj", for every component root r in F, and then checks 
whether an AAF of Ti and T2 with at most fc + 1 components can be obtained from F by fixing a 
subset of the marked nodes. 

To produce this labelling of component roots, we augment the three different cases of Step [6] to 
label the bottom endpoints of the edges they cut in F<i. When a labelled node x loses a child I by 
cutting the child's parent edge e/, x is contracted into its other child r; in this case, r inherits x's 
label. This ensures that at any time exactly the roots in the current forest F<i are labelled. 

The following is the pseudo-code of the MAAF algorithm augmented with this labelling pro- 
cedure. It shows which case of Step [6] assigns which labels to the component roots it produces. 
In the description of the algorithm, we use k to denote the parameter passed to the current in- 
vocation (as in the MAF algorithm), and fco to denote the parameter of the top-level invocation 
Maaf (Ti, T2, ko). Thus, ko + 1 is the number of connected components we allow the final AAF to 
have. 

1. (Failure) If k < 0, there is no subset E of at most k edges of F2 such that F2 — E yields an AF 
of Ti and T2. Return "no" in this case. 

2. (Refinement) If \Rt\ < 2, then F2 = F2 U F is an AF of T\ and T2. Invoke an algorithm 
Refine (F2, ko) that decides whether F2 can be refined to an AAF of Ti and T2 with at most 
ko + 1 components. Return the answer returned by Refine {F2, ko)- 

3. (Prune maximal agreeing subtrees) If there is a node r £ Rt that is a root in F2, remove r from 
Rt and add it to R^, thereby moving the corresponding subtree of F2 to F; cut the edge e r in 
Ti and apply a forced contraction to remove r's parent from Ti; return to Step[2j This does not 
alter F2 and, thus, neither e (Ti, T2, i^)- If no such root r exists, proceed to Stepdl 

4. Choose a sibling pair (a, c) in Ti such that a, c G Rt- 

5. (Grow agreeing subtrees) If (a, c) is a sibling pair of F2, remove a and c from Rt; label their 
parent in both trees with (a, c) and add it to Rt] return to Step[2j If (a, c) is not a sibling pair 
of F2 , proceed to Step EJ 

6. (Cut edges) Distinguish three cases: 

6.1. If a ^p 2 c, make two recursive calls: 

(i) Maaf (F 1 ,F 2 -r {e a }, k - 1) with a labelled "T 2 " in F 2 -r {e a }, and 

(ii) Maaf {F 1 ,F 2 -r {e c }, k - 1) with c labelled "T 2 " in F 2 -r {e c }. 

6.2. If a c and the path from a to c in F2 has one pendant node b, swap the names of a and 
c if necessary to ensure that b is a's sibling. Then make three recursive calls (see Figure [TT]) : 

(i) Maaf (Ti, F 2 -=- {e b }, k - 1) with 6 labelled "Ti" in T 2 -=- {e fe }, 

(ii) Maaf (Ti, F 2 -=- {e c }, fc - 1) with c labelled "T 2 " in T 2 -=- {e c }, and 

(iii) Maaf {F 1 ,F 2 -=- {e a , e c }, - 2) with c labelled "Ti" and a labelled "T 2 " in F 2 -h{e a , e c }. 

6.3. If a ~i? 2 c and the path from a to c in F2 has q > 2 pendant nodes 61 , 62, ... , 6 g , make three 
recursive calls: 
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Figure 11: Case HOI of Step [6] of the rooted MAAF algorithm. (Cases [67T1 and RT31 are as in the 
rooted MAF algorithm and are illustrated in Figure [6]). Only F 2 is shown. Each box represents a 
recursive call. 

(i) Maaf (Fi, F2 -T- {e bl , e b2 , . . . , e bq }, k — qj with each node bi, 1 < i < q, labelled "Ti" in 
F 2 -r {e bl ,e b2 ,. . . ,e bq }, 

(ii) Maaf (F 1 ,F 2 -=- {e a }, k - 1) with a labelled "T 2 " in F 2 -=- {e a }, and 

(iii) Maaf (F 1 ,F 2 -r {e c }, k - 1) with c labelled "T 2 " in F 2 -r {e c }. 

Return "yes" if one of the recursive calls does; otherwise return "no". 

To give some intuition behind the choice of labels in Step [6] and to prove the correctness of the 
algorithm, we modify the algorithm slightly without altering the set of forests it computes. When 
cutting an edge e x , x £ {a, c}, in Step [6] of the algorithm, x becomes the root of a component of F 2 
that agrees with a subtree of F±. Hence, the first thing Step [3] of the next recursive call does is to 
cut the parent edge of x in F±. We modify the algorithm so that, instead of postponing the cutting 
of this edge to Step [3] of the recursive call, we cut the parent edges of x in F\ and F 2 simultaneously 
in StepEJ 

Now consider a labelled node x of F 2 , and let y\ and y 2 be x's siblings in F\ and F 2 , respectively. 
If the algorithm cuts the edge e x , x becomes a root and, in the absence of further changes that 
eliminate x, y\ or y 2 from the forest, x is the head of a Ti-hybrid edge (yi,x) and of a T 2 -hybrid 
edge (y 2 ,x), making y% and y 2 potential exit nodes that may need to be fixed to obtain a certain 
AAF of T\ and T 2 . The first step in fixing a potential exit node is cutting its parent edge, and 
an alternate method to produce the same AAF starts by cutting this parent edge instead of e x . 
Thus, if apart from cutting e x , the current case includes a branch that cuts the parent edge of 
D\ or 2/2) we do not have to worry about fixing this exit node in the branch that cuts e x — there 
exits another branch we explore that has the potential of leading to the same AAF. To illustrate 
this idea, consider Case 16. ll Here, when we cut e a , c becomes the tail of the Xi-hybrid edge (c, a) 
because a and c are siblings in F\. Since the other branch of this case cuts e c , we do not have 
to worry about fixing c. On the other hand, neither of the two cases considers cutting the parent 
edge of a's sibling b in F 2 , which is the tail of the T 2 -hybrid edge with head a. Thus, we need to 
ensure that b is marked in the refinement step, which we do by labelling a with "T 2 ". The same 
reasoning justifies the labelling of c with "T 2 " in the other branch of this case and the labelling 
of a and c with "T 2 " in Case [6731 The labelling of every node bi with "Ti" in Case [6731 is equally 
easy to justify: Cutting an edge e bi make bi a root in F 2 . Thus, either 6, is itself a root of the final 
AF we obtain or it is contracted into such a root z using forced contractions; this root z inherits 
biS label. At the time we cut edge e bi , we do not know which descendant of b{ will become this 
root z, nor whether any branch of our algorithm considers cutting the parent edge of the tail of z's 
Ti-hybrid edge. On the other hand, the tail of z's T 2 -hybrid edge is either a or c, and we cut their 
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parent edges in the other two branches of Case 16.31 The labelling in Case 16.21 is more difficult to 
justify, but the proof of Theorem [9] below shows that the algorithm is correct. 

Among the agreement forests of T\ and T<i produced by the bounded search part of the algorithm, 
there may be several that can be refined to an AAF of T\ and T2 with at most k + 1 components. 
As already said, while we proved that the algorithm from the previous subsection can obtain an 
AAF with at most k + 1 components from each such agreement forest F by fixing an appropriate 
subset of potential exit nodes, not all nodes in this subset of potential exit nodes may be marked 
by the algorithm in this subsection, preventing the refinement step from finding any AAF with at 
most k + 1 components as a refinement of F. Here we choose a canonical agreement forest Fc 
from among the set of agreement forests that can be refined to produce an AAF with at most 
k + 1 components. The proof of Theorem [9] below shows that the potential exit nodes in Fq 
that need to be fixed to obtain such an AAF are marked. Since Fq is produced by a sequence of 
recursive calls of procedure Maaf (•, •, •), we can define Fq by defining the path to take from the 
topdevel invocation M AAF (T\ , T2 , k) to the invocation Maaf (F\, F2, k') with F2 = Fc- We use 
Ff and F\ to denote the inputs to the ith invocation Maaf {F\^F\^k^ along this path. We also 
compute an arbitrary numbering of the nodes of T\ and denote the number of x £ T\ by v{x). This 
number is used as a tie-breaker in certain situations to choose the next invocation along the path of 
invocations that produce Fc- The first invocation is of course MAAF (Ti, T2, ko), that is, Ff = T\ 
and i*2 = T2. So assume we have constructed the path up to the ith invocation with inputs F{ 
and F\- The (i + l)st invocation is made in Step [6] of the ith invocation. We say an invocation 
Maaf (F\,F2, k) is a leaf invocation if F2 is an AF of T\ and T2. A leaf invocation Maaf (Fi, F2, k) 
is viable if F2 can be refined to an AAF of T\ and T2 with at most k$ + 1 components. A nondeaf 
invocation Maaf (F\,F2, k) is viable if at least one of the recursive calls it makes is viable. If there 
is only one viable recursive call made in Step [6] of the ith invocation Maaf (Ff, F%, fci) , then we 
choose this recursive call as the (i + l)st invocation Maaf F^ +1 , fcj+i) • Otherwise we apply 

the following rules to choose Maaf , F% + , k^x) from among the viable invocations made in 
Step [6] of invocation Maaf (Fl, F|, kA. We distinguish the three cases of StepEl 

Case EH In this case, Maaf {F{ -t- {e }, -r {e a }, h t - l) and Maaf (F[ -r {e c }, F\ -r {e c }, h - l) 
are both viable invocations. For x £ {a, c}, let F x be the agreement forest found by tracing 
a path from Maaf (F[ -j- {e x }, F\ -j- {e x }, ki — l) to a viable leaf invocation using recursive 
application of these rules, and let E x be an edge set such that F x = T\ E x . Let y be 
the sibling of x in F{ (i.e., y = c if x = a and vice versa). Now let <f>i(y) once again be 
the LCA in T\ of all labelled leaves that are descendants of y in F\, and let 4> x (y) be the 
LCA in F x of all labelled leaves I that are descendants of <j>\(y) in T\ and such that the 
path from I to (j>i(y) in T% does not contain an edge in E x . Finally, if 4> x {y) is the root of 
a component of F x , let \i(y) '■= <fii(y)', otherwise let Ai(y) be the LCA in T\ of all labelled 
leaves that are descendants of the parent of 4> x {y) in F x . In other words, if (j) x (y) is not a 
root in F x , then Ai(y) is the node in T\ where <p x {y) and its sibling in F x are joined by an 
application of Step El Now let d\{y) > be the distance from the root p of T\ to \\{y) if 
Ai(y) / (pi(y)i an d di(y) = otherwise. If di(a) > d\{c) or d\{a) = d\(c) and u(a) < v(c), 
we choose the invocation Maaf (F{ -j- {e a }, F| -f- {e a }, h — l) as the (i + l)st invocation, that 
is, Fc = F a . If d\(a) < d\(c) or d\{d) = di(c) and v{a) > v(c), we choose the invocation 
Maaf [F{ {e c }, -r- {e c }, ki — l) as the (i + l)st invocation, that is, Fc = F c . 

Case 16.21 In this case, if Maaf (FJ -t- {e a , e c }, F£ -r- {e a , e c }, fcj — 2) is viable, we choose it as the 
(i+l)st invocation. If the invocation Maaf (_F{ -j- {e a , e c }, i^J -j- {e a , e c }, fcj — 2) is not viable, 
then the invocations Maaf (F[, F\ + {e b }, ki - l) and Maaf (F{ {e c }, F 2 * ^- {e c }, fej - l) 
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Figure 12: Two applications of Case 16.11 where we choose the invocation 
Maaf (F{ -=- {e a }, F\ -=- {e a }, h - l) on the path to F G . Both fi gures show the relevant por- 
tion of T\. Dotted edges have been removed to obtain Ff, making a and c siblings in F{. The 
bold portion of T\ yields F\. Node b is a's sibling in F c . In Figure (a), d is c's sibling in F a , and 
the highest node Ai(c) on the path from <j>i(c) to (j>i(d) is an ancestor of the highest node Ai(a) on 
the path from 4>i[a) to 4>\{b). Hence, d\(a) > di(c). In Figure (b), c is assumed to be a root of F a . 
Hence </>i(c) = Ai(c) and d±(a) > d\(c) = 0. 

are both viable. In this case, we choose the latter as the (i + l)st invocation. 

Case 16. 3L Since there is more than one viable invocation in this case, at least one of the invoca- 
tions Maaf (Fj -=- {e a }, F\ {e a }, fe* - l) and Maaf [F{ -=- {e c }, Fj -j- {e c }, fc, - l) is viable. 
If exactly one of them is viable, we choose it to be the (i + l)st invocation. If both are viable, 
we define \\(a) and Ai(c) as in Case [670 If \\{a) 7^ Ai(c), we choose the (i + l)st invocation 
as in Case I6.ll If \\(a) = Ai(c), we define \z(x) and dz(x), for x £ {a,c}, analogously to 
Xi(x) and d\(x) but using 02( - ) and T2 in place of 4>i(-) and T\. Now we choose the (i + l)st 
invocation as in Case 16. ll but using c?2(-) instead of di(-). 

Now we are ready to prove the following improvement of Theorem El 

Theorem 9. For two rooted X-trees T\ and T2 and a parameter ko, it takes O ((2(1 + y/2)) k °n) = 
O (4.84 fc °n) time to decide whether e(Tx,T2,T2) < ko- 

Proof. As in the proof of Theorem El the running time of the algorithm is determined by the 
number of agreement forests the bounded search part of the algorithm produces times the cost of the 
refinement step invoked on each of these agreement forests. The former remains O ((1 + \/2) fc °) = 
O (2.42 fc °), in spite of the additional recursive call in Case 16.21 because the recurrence for this case 
is now I(k) = 1 + 2I(k — 1) + I(k — 2), which is the same as the worst case of the recurrence 
for Case 16.31 The cost of the refinement step is reduced to O (2 k °nj because, for each of the 
at most ko edges the bounded search part cuts in T2 to produce a given agreement forest, it 
marks one potential exit node in the forest, and the refinement step tries to fix every possible 



36 



combination of these at most ko marked potential exit nodes. Thus, the total cost of the algorithm 
is O (2.42 fc ° • 2 k °n) = O (4.84 fc °n). It remains to prove its correctness. 

Since the algorithm returns "yes" only if it finds an AAF of T\ and T2 with at most ko + 1 
components, the algorithm correctly returns "no" if e{T\,T2,T2) > k$. 

So assume e^T\^Ti,T<i) < ko, let Fc be the canonical agreement forest we defined before the 
theorem, and let E be an edge set such that F' := Fc -r- E is an AAF of T\ and T2 with at most 
ko + 1 components. By Corollary UJ we can assume E is the union of paths from a subset of potential 
exit nodes to the roots of their respective components in Fc- These potential exit nodes may or 
may not be marked. Now let M be the set of nodes m G Fc such that every edge on the path from 
m to the root of its component in Fc is in E and m or its sibling in Fc is marked. We say an edge 
is marked if it belongs to the path from a node mSMto the root of its component, that is, if it is 
removed by fixing this node m. Next we prove that all edges in E are marked. Since fixing a node 
or its sibling in Fc results in the same forest and every node in m is itself marked or has a marked 
sibling, this implies that there exists a subset of marked potential exit nodes such that fixing them 
produces F', that is, the refinement step applied to Fc finds F'. 

Assume for the sake of contradiction that there is an unmarked edge in E. Since all ancestor 
edges of a marked edge in Fc are themselves marked, this implies that there exists a potential exit 
node u G G* Fc whose parent edge e u belongs to E but is not marked, which in turn implies that 
neither u nor its sibling u' in Fc is marked. The sequence of invocations that produce Fc from T\ 
and T2 give rise to a sequence of edges the algorithm cuts to produce Fc'- For a step that cuts more 
than one edge, we cut these edges one by one. For Step [3] and branch (i) of Step 16.31 this ordering 
is chosen arbitrarily. For all branches of Case [6] that cut an edge e x with x G {a, c}, we choose the 
ordering so that the parent edge of x in F\ is cut immediately after cutting e x in F2. Finally, in 
branch (iii) of Case 16.21 we cut e c after e a . Changing notation slightly, we use F{ and F\ to refer 
to the forests obtained from T\ and T2 after cutting the first i edges in the remainder of this proof. 
Since Fc is a refinement of Ff and F\, every node x G Fc maps to the lowest node y in F % - such 
that the labelled descendant leaves of x G Fc are descendants of y in Fj. This is analogous to the 
mappings 4>i(-) and <fe( - ) from Fc to T\ and T2. To avoid excessive notation, we refer to the nodes 
in Fl and F\ a node x G Fc maps to simply as x in the remainder of this proof. 

With this notation, the common parent p u of u and v! in Fc is the lowest common ancestor of 
both nodes in any forest Fj. Since u is a potential exit node of Fc, there is at least one hybrid 
edge in G* Fc induced by cutting a pendant edge of the path from u to p u in some forest F l - . There 
may also be a hybrid edge induced by cutting a pendant edge of the path from v! to p u in some 
forest Fj. Either of these two types of edges are pendant to the path from u to u' in Fj. Let % be 
the highest index such that the ith edge we cut is pendant to the path from u to v! in F^ 1 or F^ -1 , 
and let e y be this edge. Let j G {1, 2} so that we cut e y in Fj" . Since u and v! are siblings in Fc, 
y is the only pendant edge of the path from u to v! in F^ 1 , that is, either u or v! is y's sibling 
111 Fj- 1 . We use x to refer to this sibling, and x 1 to refer to x's sibling in Fc (that is, x' = u' if 
x = u and vice versa). We make two observations about x, x' , and y: 

1. Since fixing a node in Fc or its sibling produces the same forest, F' can be obtained from Fc 
by fixing a subset of nodes that includes x or x'. In particular, we can obtain F' from any 
forest Fj by cutting e x or e x > and then continuing to cut an appropriate set of edges in the 

resulting forest, that is, e (t u T 2 , Fj + {e x }) = e (TuT^F} + fe,}) = e (Ti,T 2 , F^j - 1. 

2. Since edge e u is not marked, neither u nor v! is marked, that is, x is not marked in Fc and, 
hence, y is not labelled lc Tj" in Fq. 
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Now we examine each of the steps that can cut e y and prove that these observations lead to a 
contradiction. 

First assume the ith edge e y we cut belongs to i 7 ^ -1 . Then e y is cut by an application of Step [3] 
in response to cutting an edge e& or e\ )i earlier, or e y is the parent edge in F^ 1 of a node y G {a, c} 
whose parent edge in F\~ 2 is the (i — l)st edge we cut — both edges are cut successively in StepEJ 
First assume the former. Then y is a root in i 7 ^ -1 , which implies that there exists a fc < i such that 
the fcth edge we cut is an edge e z in F^~ l such that z is an ancestor of y in F^~ l and z is a node 
b or 6j in this application of StepEJ We choose the maximal such k. This implies that no edge on 
the path from y to z is cut by any subsequent step. Indeed, if we cut such an edge, it would have 
to be an edge e z > with z' G {a, c}, by the choice of k. If z' = y, then e y would be cut in StepEJ if 
z' ^ y, then e y would belong to a subtree of F\ whose root is the member of a sibling pair, and e y 
would never be cut. In either case, we obtain a contradiction. Now observe that any case of Step [6] 
that cuts an edge ej or labels b or bi with "Ti", that is, z is labelled with "Ti" immediately 
after cutting e z . Since we have just argued that no edges are cut on the path from z to y and y is 
a root in F/j -1 , our rules for maintaining labels under forced contractions imply that y inherits z's 
"Ti"-label, a contradiction. 

Now suppose e y belongs to F[~ l and is cut in StepEl that is, y G {a, c}. Since StepEJcuts an edge 
in Fi immediately after cutting the corresponding edge in F2, the (i — l)st edge we cut is y's parent 
edge in F^~ 2 ■ If e y is cut by an application of Case l6.lt assume w.l.o.g. that y = c and, hence, x = a. 
Since the invocation Maaf (Fj* -2 , F"] -2 , fc) that cuts e y is viable and e (Ti, T 2 , Ff~ 2 + {e x }) = 
e (Ti, T2, F^ -2 ) — 1, the invocation Maaf (Fj 1-2 -j- {e^}, Fj -2 -f- {e x }, fc — l) is also viable. Since we 
apply Case 16.11 x and 2/ are siblings in Fc, and Fc is a refinement of F/j -2 , we have x' ~^i-2 
x y. Since F x is also a refinement of F£~ 2 , this implies that x' 00 Fx y. In particular, 

x' and y are not siblings in F x . Since e y is the only pendant edge of the path from x to x' 
in F[~ 2 , this implies that either y is a root in F x or its parent in F x is a proper ancestor in 
F[~ 2 of the common parent of x and x' in F y = Fc. In both cases, d\(y) < di(x), contradicting 
that we chose the invocation Maaf (Fj 1-2 -f- {e y }, F^ 2 -f- {e^}, A; — l) instead of the invocation 
Maaf (i 7 ^ -2 {e x }, F^~ 2 H- {e x }, A; - l) on the path to F c . 

If e y is cut by an application of Case 16. 2\ y is labelled with "Ti" unless y = a and we apply 
the third branch of this case, or y = c and we apply the second branch of this case. If y = a 
and we apply the third branch, then x = c and the (i + 2)nd edge we cut is edge e c in Ff , 
which contradicts that x = c has a sibling in Fc- If y = c and we apply the second branch of this 
case, then x = a. However, since e (Ti,T2, Fc -r- {e x }) = e(Ti,T2, Fc) — 1 and we cut edge e y to 
obtain Fc from i 7 ^ -2 , we have in fact e (Ti, T2, F % 2 ~ 2 -j- {e a , e c }) = e (Ti, T2, T^ -2 ) — 2, that is, the 
invocation Maaf (Fl~ 2 -t- {e a , e c }, i 7 ^ -2 -r- {e a , e c }, k — 2) is viable. This contradicts that we chose 
the invocation Maaf {F{~ 2 -t- {e c }, i 7 ^ -2 -j- {e c }, fc — l) as the next invocation on the path to Fc- 

Finally, suppose e y is cut by an application of Case 16.31 If x' and y are not siblings in F x , 
then the same argument as for Case loTTI leads to a contradiction to the choice of Fc- So assume 
that x' and y are siblings in F x , that is, that d\(x) = d%(y). Since e y is the last pendant edge of 
the path from x to x' in either of the two forests F\ and F2, x and x' are siblings in F%~ ■ This 
implies that either x and x' are siblings also in i 7 ^ -2 or e y is the only pendant edge of the path 
from x to x' in i 7 '] - . In the first case, we have ^(y) < d2(x), contradicting that we chose the 
invocation Maaf (i 7 ^ -2 -r- {e y },F^ 2 -r- {e y }, fc — l) on the path to Tb, even though the invocation 
Maaf (i 7 ^ -2 -7- {e x }, F/j -2 4- e x , fc - l) is viable. In the second case, cutting e y in i 7 ^ 2 labels y with 
"T2". Since y is the sibling of x or x' in F"j -2 , this implies that x or x' is marked in Fc, again a 
contradiction. 

Finally, assume e y belongs to i 7 ^ -1 . Then e y is cut by an application of Case 16.21 or Case 16.31 
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because Case 16.11 labels the bottom endpoint of each edge it cuts with "T2" , contradicting that y 
is not labelled with "T 2 ". 

In Case 16, 2\ e y is either e& or e c because, when edge e a is cut, a is labelled with "T2". First 
suppose e y = e^. Since e y is the last pendant edge of the path from x to x' we cut in either 
of the two forests F\ and F2, we have x = a and x' = c. However, since the current in- 
vocation Maaf fc) is viable and e (T u T 2 , F l 2 ~ l 4 {e c }) = e (Ti, T 2 , F 2 i_1 4 {e x >}) = 
e (Ti,T 2 , F^" 1 ) - 1, the invocation Maaf (Fj^ 1 4- {e c }, F^ -1 4- {e c }, fc - l) is also viable, which 
contradicts that we chose the invocation Maaf (Ff~ ,F%~ 4 {ef,},fc — l) as the next invocation 
on the path to Fc- 

If e y = e c , it must be an application of branch (hi) of Case 16.21 that cuts e y because branch (ii) 
labels c with "T2". In this case, x = b because we cut e a before e c . Then, however, b is a's sibling 
in F2 and the tail of a's T2-hybrid edge. Since a is labelled "T2" in this case, this implies that x = b 
is marked in Fc, a contradiction. 

In Case 16.31 we label a or c "T2". So e y must be e^, for some pendant edge e& 4 of the path from 
a to c in F 2 . Along with the fact that e y is the last pendant edge of the path from x and x' we 

cut, this implies that x = c or x' = c. Since the invocation Maaf (^F^~ q , F^ -9 , that cuts edges 

bx,b 2 ,...,b q is viable and e (T 1 ,T 2 ,F^~ q 4 {e^}) = e (Ti,T 2 , F % 2 ~ q 4 {e x /}) = e m, T 2 , F^ -9 ) - 1, 

the invocation Maaf {F l ^ q 4 {e c }, 4 {e c }, fc — 1^ is also viable, contradicting that we chose 

the invocation Maaf (^F\~ q , F l 2 ~ q 4 {e^ , efe 2 , . . . , eb q }, fc — q^j as the next invocation on the path to 
F c - ' ' □ 



5.5 Improved Refinement and Analysis 

The algorithm developed so far considers all subsets of marked potential exit nodes to fix in each 
agreement forest F it finds, and we argued that there are at most 2 k such subsets to consider, each 
taking linear time to check. However, if fc' is the number of edges we cut to obtain F, there are 
in fact only fc' marked potential exit nodes and, hence, only 2 k subsets of marked potential exit 
nodes to consider. When fc' is small, the resulting time bound of O ^2 k 'nj for the refinement step 

is substantially better than the bound of O (2 fc ra) obtained using the naive upper bound of fc' < fc. 
For large values of fc', we observe that F has fc' + 1 components because we always cut edges in 
a fully contracted forest (i.e., a forest without degree-2 vertices other than its component roots). 
When fixing a set of fc" potential exit nodes in the refinement step, we cut at least fc" edges, and 
this increases the number of connected components by at least fc", again because we cut edges along 
paths in fully contracted forests. Thus, if fc' + fc" > fc, we cannot possibly obtain an AAF with at 
most fc + 1 components: the refinement step applied to F needs to consider only subsets of at most 
fc" := fc — fc' potential exit nodes. Since there are fc' marked potential exit nodes to choose from, 
this reduces the running time of the refinement step applied to such a forest F to O ^X^=o • 

For large values of fc', fc" is small and the sum is significantly less than O (^2 k 'nj = O (2 fc n). Thus, 
we obtain a substantial improvement of the running time of the refinement step also in this case, 
without affecting its correctness. 

To analyze the running time of our algorithm using this improved refinement step, observe first 
that the running time of the algorithm excluding the refinement steps is O (2.42 fe n) , as argued in the 
proof of Theorem [9l Since this is below the time bound of O (3.18 fc n) we want to prove, it suffices 
to prove that the running time of the refinement steps is O (3.18 fc n). To this end, we split each 
refinement step into several refinement steps. A refinement step that tries all subsets of between 
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and k" marked potential exit nodes is replaced with k" + 1 refinement steps: for < j < k" , 
the jth such refinement step tries all subsets of exactly j marked potential exit nodes. Its running 
time is therefore O ^(^.)n^, and the total cost of all refinement steps remains unchanged. Now we 
partition the refinement steps invoked for the different agreement forests the algorithm finds into 
k + 1 groups. For < h < k, the hth group contains a refinement step applied to an agreement 
forest F if the number k! of edges cut to obtain F and the size j of the subsets of marked potential 
exit nodes the refinement step tries satisfy k' + j = h. We prove that the total running time of all 
refinement steps in the hth group is O (3.18 h n). Hence, the total running time of all refinement 

steps is O (ELo 3 - 18 ^) = (3-18 fc n), as claimed. 

Now consider the tree of recursive calls made in the bounded search part of the algorithm. Since 
a given invocation Maaf (F\, F2, k") spawns further recursive calls only if F2 is not an agreement 
forest of T\ and T2, and we invoke the refinement step on F2 only if F2 is an agreement forest of 
T\ and T2, refinement steps are invoked only from the leaves of this recursion tree. Moreover, since 
every refinement step in the hth group satisfies k' +j = h and, hence, k' < h, refinement steps in the 
hth group can be invoked only for agreement forests that can be produced by cutting at most h edges 
in T2. Thus, to bound the running time of the refinement steps in the hth group, we can restrict 
our attention to the subtree of the recursion tree containing all recursive calls MAAF (F±, F2, k") 
such that F2 can be obtained from T2 by cutting at most h edges, that is, k" > d := k — h. Since we 
want to obtain an upper bound on the cost of the refinement steps in the hth group, we can assume 
that the shape of this subtree and the set of refinement steps invoked from its leaves are such that 
the total cost of the refinement steps is maximized. We construct such a worst-case recursion tree 
for the refinement steps in the hth group in two steps. 

First we construct a recursion tree without refinement steps and such that, for each d < k" < k, 
the number of invocations with parameter k" in this tree is maximized. As in the proof of Theorem^ 
this is the case if each recursive call with parameter k" > d + 2 makes three recursive calls, two 
with parameter k" — 1 and one with parameter k" — 2, and each recursive call with parameter 
k" = d + 1 makes two recursive calls with parameter k" — 1. As in the proof of Theorem HJ this 
implies that every recursive call with parameter k" has a tree of ( (l + \/2j J recursive calls 



below it, and the size of the entire tree is O ^(l + V2) k ^ = O ^(l + \/2) h ^j. The second step is 
to choose a subset of recursive calls in this tree for which we invoke the refinement step instead of 
spawning further recursive calls, thereby turning them into leaves. In effect, for each such node with 

(l + v2) I recursive calls with a single refinement 

k'\„ \ ,, r u„„„ U 7„ ;„// u i ^7 Ml 



step of cost O ( ( ■ j n j , where k' = k — k" = h + d—k" and j = h — k! = k" — d. By charging the cost 
of this refinement step equally to the nodes in the removed subtree, each node in this subtree is 
charged a cost of G ({ k -)n/ (l + V2) k "~ d ) = G (( k -)n/ (l + V2) 1 ) . The total running time of all 



refinement steps in the hth group is the sum of the charges of all nodes removed from the recursion 
tree. If we choose k! and j so that k' + j = h and ( k . ) / (l + \/2) J is maximized, no removed node 

is charged more than Q (( k -)n/ (l + V^) 3 ) ■ Since we can remove at most 0^(1 + V^)^ nodes 
from the tree, the cost of all refinement steps in the hth group is therefore 



0^(l + V2 



<> ^ ((l + ^"(%). (2) 



It remains to bound this expression by O (3.18 ft n). First assume that k! < 2h/3. Then we can 
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bound (*') by 2 k ' , and (l + V2) h ' • g') by (2 + 2\/2) fc ' < 4.84 2/l / 3 < 2.87^, that is, © is bounded 

by O (2.87 h n). For A/ = h, we have j = and, hence, ([2]) is bounded by O (2A2 h n) in this case. 
To bound ([2]) for 2/i/3 < k' < h, we make use of the following lemma. 



Lemma 13. = O - 



, , - V / \ x—y ' 

£ \ „ ( X \ / X x 



Proof. Stirling's formula states that 

r xl i 

hm — ■==- — -— = 1. 

x^oo 2^7rx(x/e) x 

In other words, xl = O (y^x/e) 1 ). Substituting this into the expansion of Q) gives 

x\ ( \fx{xje) x 



y) yl(x-y)\ \y/y{y/e) y \/x-y({x-y)/e) x vj' 
Since either y > x/2 or x — y > x/2, we have ^J x _ y < \/2. Thus, 

^ - o (x/e)x "" ^ = o f f-y ( x y~ y ) □ 



\{y / e)y {{x - y) / e) x y J \\yj \x-y 
Lemma [13] allows us to bound ([2|) by 



O 1 + V2 - 



k' fk'\ j f k' \ k '~ j 



j J \k' -j 



n 



a \ " ~ / a 



"ft / ah \ ll -" ;,t / a/i 

n 



\-d)h) \{2a-l)h 



O (l + \/2 



l-a / \ 2a- 1\ At 



1 — a J \2a — 1 



■n 



where k' = ah and, hence, j = (1 — a)h. It remains to determine the value of a such that 
2/3 < a < 1 and the function 



l-a / \ 2a-l 



is maximized. Taking the derivative and setting to zero, we obtain that b(a) is maximized for 
a = | + ' w ^ cri §i ves K a ) — 3.18. This finishes the proof that the total cost of the 

refinement steps in the hth group is O (3.18 h n), which, as we argued already, implies that the 
running time of the entire algorithm is O (3.18 fc n). Thus, we have the following theorem. 

Theorem 10. For two rooted X -trees T\ and T2 and a parameter k, it takes O (3.18 fc n) time to 
decide whether e(Ti,Ti,T2) < k. 
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As with the MAF algorithms, we can use known kernelizations [9] to transform the trees T\ 
and T2 into two trees T[ and T' 2 of size O (e (T\, T%, T2)). However, unlike the kernelizations used 
for SPR and TBR distance, these kernelization rules produce trees that do not have the same 
hybridization distance as T\ and T2. One of these rules, the Chain Reduction, replaces a chain of 
leaves 01,02, •• • with a pair of leaves a, b. Bordewich and Semple [9 J showed that in an MAAF of 
the resulting two trees, either a and b are both isolated or neither is. A corresponding MAAF of 
T\ and T2 can be obtained by cutting the parent edges of a%, 02, ... in the first case or replacing a 
and b with the sequence of leaves 01,02,... in the second case. The difference in size between these 
two MAAFs is captured by assigning the number of leaves removed by the reduction as a weight 
to the pair (a, b). The weight of an AAF of the two reduced trees T[ and T 2 then is the number of 
components of the AAF plus the weights of all such pairs (o, b) such that a and b are isolated in 
the AAF. This weight equals the size of the corresponding AAF of T\ and T2. 

It is not difficult to adapt these weights to our fixed-parameter algorithm. Whenever the 
refinement algorithm would return "yes", we first add the sum of the weights of isolated pairs 
to the number of components in the found acyclic agreement forest. If, and only if, this total 
is less than or equal to fco we return "yes". Any agreement forest F of T{ and T 2 with weight 
w(F) = e(Ti,T2,T2) has at most w(F) components and so will be examined by this strategy. 
Similarly, the depth of the recursion is bounded by the number of components, and, thus by ko. 
Thus, we obtain the following corollary 

Corollary 5. For two rooted X-trees T\ and T2 and a parameter k, it takes O (3.18 fc /c + n 3 ) time 
to decide whether e{T\,T2,T2) < k. 

6 Conclusions 

The algorithms presented in this paper are the theoretically fastest algorithms for computing SPR 
distances and hybridization numbers of rooted phylogenies and for computing TBR distances of 
unrooted phylogenies known to date. While the same ideas we used to obtain FPT algorithms 
for these problems also lead to the best known approximation algorithms for SPR distance and 
TBR distance, the global nature of the acyclicity constraint for hybridization numbers forced us to 
take a less localized approach to cycle breaking in our hybridization algorithm, which remains the 
key obstacle in obtaining an approximation algorithm also for hybridization numbers. A similar 
observation was made also by Bordewich et al. [9]. 

Another important open problem is extending our approach to computing maximum agreement 
forests and maximum acyclic agreement forests for multifurcating trees and for more than two trees. 
Evolutionary biologists often construct phylogenetic trees using methods that assign a measure 
of statistical support to each edge of the tree. Contracting edges with poor statistical support 
eliminates bipartitions that may be artifacts of the manner in which the tree was constructed but 
the resulting trees are multifurcating trees. If we can extend our methods to support multifurcating 
trees, the comparisons of binary phylogenies our new algorithms make possible can be applied also 
to multifurcating trees. The kernelization results of Linz and Semple [19] for maximum acyclic 
agreement forests apply to such trees. Extending our bounded search tree approach to computing 
agreement forests to multifurcating trees is currently the focus of ongoing efforts on our part. A 
first step of comparing multiple phylogenies over a set of species could be to identify groups of 
species for which all trees tell the same evolutionary "story", which is exactly what a maximum 
agreement forest of all the trees in the given set would represent. The only previous result in this 
direction is the 8-approximation algorithm of Chataigner |llj for computing an MAF of two or 
more unrooted phylogenies. We are optimistic that the ideas presented in this paper can be used 
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as a basis for obtaining an exact solution to this problem and extending the results also to rooted 
phylogenies. 

As we discussed on page [29j the simple approach of identifying cycles in the expanded cycle 
graph G* F and branching on cutting the parent edges of the exit nodes in this cycle results in an FPT 
algorithm with running time O (x fc ra) for hybridization, where x is the length of the longest cycle in 
Gf- If, instead of branching, we cut the parent edges of all exit nodes in the cycle simultaneously, we 
obtain an ^-approximation algorithm. Since x can be as large as k, these results are theoretically 
not appealing. In practice, however, such an approach may work well, particularly if using the 
approximation algorithm in a branch-and-bound approach to finding the exact answer, as we did 
for rooted MAF in [28] . 

While the theoretical results presented in this paper are interesting in their own right, as 
they shed further light on the complexity of computing agreement forests, experimental results 
indicate that our algorithms also perform very well in practice. In [28] we evaluated the practical 
performance of our algorithms for rooted SPR distance and demonstrated that they are an order 
of magnitude faster than the currently best exact alternatives [HE] based on reductions to integer 
linear programming and satisfiability testing, respectively. The implementation and its source 
code are publicly available [30]. The largest distances reported using implementations of previous 
methods are a hybridization number of 14 on 40 taxa [6] and an SPR distance of 19 on 46 taxa [31]. 
In contrast, our method took less than 5 hours to compute SPR distances of up to 46 on trees 
with 144 taxa and 99 on synthetic 1000-leaf trees and required less than one second on average to 
compute SPR distances of up to 19 on 144 taxa. This represents a major step forward towards tools 
that can infer reticulation scenarios for the thousands of genomes that have been fully sequenced 
to date. 
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